1. Introduction

1.1 Background

In 2018, Capital Metropoitan Transportation Authority (CapMetro), a public transportation agency serving Austin, Travis and parts of Williamson Counties, launched the “Cap Remap”, a bus system redesign project, as part of its transit development plan, Connections 2025. Cap Remap adjusted the transit network according to internal analysis and community outreach and aims to provide a more frequent, more reliable, and better connected bus system. Specifically, it remapped certain routes, tripled the number of bus routes that operate every 15 minutes, and made sure the frequency meets the need on weekends. This project brings an opportunity to understand what factors influence bus ridership.

1.2 Use Case

Given the renewed interest in bus transit in US cities, such as Austin, there is an opportunity to streamline the bus planning process using modern data science methods. Currently, cities have to gather all the information, such as land use, built environment, demographics etc., from different sources, to gain understanding of bus ridership change in the future. This method is usually time consuming and requires a lot of human resources. Oftentimes, cities have to outsource those analysis to third parties, which inevitably leads to higher project cost. The goal of this article, therefore, is to present a scenario planning tool for planners to test how changes in local land uses and characteristics of bus routes predict bus ridership. If such a predictive model proves robust, planners can use it to evaluate a series of possible pictures regarding the development of different land use and change of bus routes in the future and make strategic decisions efficiently in Austin. This report is broken into four sections. Section 1 presents an exploratory analysis of the Cap Remap to further understand the trends, patterns, and characteristics of the ridership in Austin, which helps to determine the important features to be incorporated in the predictive model. Section 2 explains the process of model building and model evaluation. Section 3 demonstrates the user interface of the bus network planning application, which is supported by the model developed in Section 2. The last section will be an appendix showing the codes and additional information about the model and application development.

1.3 What Data Are We Using?

The ridership data we used come from the Automated Passenger Counter(APC), which counts the number of boarding and alighting on any given bus. The image below illustrate the APC system at work. source

Thanks to the Cap Remap project and the ridership and bus system collected for it, we were able to get the average weekday daily ridership data in 2019 and use that as the dependent variable.

The dataset also provides info on route characteristics, such as route types and high ridership lines, which we call hotlines.

From multiple open data platforms, we were also able to retrieve built environment data, such as building areas, and amenities.

Then, US census provides comprehensive demographics data, such as population, vehicle ownerships, and median income.

2. Exploratory Analysis

Before diving into the model building, it is crucial to have a good grasp on the characteristics and anatomy of the bus ridership in Austin in order to construct a useful ridership prediction application for planners to utilize. This section aims to investigate Austin’s ridership data provided by APC and answer the following questions: How did ridership change before and after the implementation of CapRemap (06/03/2018)? How does the ridership change across the city? What types of route characteristics have influences on ridership? What are the popular bus routes in Austin and what are their attributes?

2.1 Annual Citywide Ridership Trend

How did ridership change before and after CapRemap (06/03, 2018)? Did it increase after the redesign?

Current available data from Capital Metro allows us to observe the trend in ridership change before and after Cap Remap. The first important part of exploratory analysis is to see the city-wide change in ridership brought by CapRemap. With the stop-level data from Janurary 207 to November 2019, the aggregated city-wide ridership change is shown in the chart below.

The x-axis represents month, and y-axis repredents the average daily ridership in the given month. The yellow, blue, red lines represents the monthly ridership in year 2017, 2018, 2019, respectively. The vertical line represents month June, which is the month when Cap Remap rolled out in 2018.

From the trend in 2018, it is clear that ridership fluctuated between months. Year 2017 and 2018 disply similar trend where we see a spike of ridership in June. However, year 2019 seems to have a smoother curve and ridership actually decreased in the same month. We suspect that there might be issues related with data retrieval and will ask for new data. Ridership are generally high in fall and winter in all three years. Our hypothesis is that this could be related with school or university holiday and class schedules. If we ignore the June data and compare the trend across three years, we do see an increasing ridership trend in general, each year has higher ridership than before, which demonstrates the effect of Cap Remap.

2.2 Ridership Pattern in Subdivisions

After knowing the trend of city-wide ridership change, the next question is how the ridership changed across the city: which area experienced ridership increase and which area exprienced ridership decrease. We first look at the riderhsip patterns according to general typology to understand the general trend, and then use neighborhoods in Austin are used here to show the spatial trend here.

We first plotted thea average ridership by typology map. UT and CBD region have higher ridership than the rest of the city.

Then we looked at the ridership by neighborhood map. As shown in the map, darker blue represents higher ridership increase, darker red presents lower ridership increase or even ridership decrease. As shown in the map, mostly downtown areas experienced ridership increase from June to September while the outskirts of Austin experienced low ridership increase or even ridership decrease.

We then created charts showing the ridership change in each neighborhood in 2018 in June and September. There are 12 neighborhoods experienced ridership decrease from June to September. There are several neighborhoods experienced high ridership increase of more than 10,000 from June to September. Generally, most neighborhoods experienced ridership increase after CapRemap from June to September. Among the 78 neighborhoods in Austin, we identified three neighborhoods that represents different characteristics: neighborhoods with expected ridership increase; neighborhoods with unexpected ridership increase; neighborhoods with unexpected ridership decrease.

UT is the neighborhood with expected ridership increase.The location of UT neighborhood is just above downtown neighborhood. With a lot of university students living around here, the bus network is sensitive to school schedule. There is a relatively clear trend in ridership change according to school seasons.

The second neighborhood Govalle is the neighborhood that experiencnig unexpected ridership increase. After CapRemap, the ridership in Govalle nearly increased by 50% to 75%. As Govalle is closer to the outskirts of Austin, this ridership increase might reflects CapRemap’success in strengthening the east-west connection.

But there are also neighborhoods exepriencing ridership decrease on the east-west direction. Zilker located in the southwest side of Austin’s downtown region. Its ridership experienced a gradually slight decrease after CapRemap.

2.3 Route Analysis

2.3.1 Route Type

What could potentially influence ridership in terms of route information?

There are several route types, each serving different purposes. Our hypothesis is that they will play an important role in determing the ridership.

The Austin Bus System is comprised of nine types of routes. The graphs below show six main route types. Since Capital Metro is a regional transit agency so that its service area covers more than City of Austin. Since we mainly focus our analysis and model building in Austin, the basemap below outlines only City of Ausitn.

Regarding route types, the characteristics are listed below:

Local: Capital Metro’s Local routes are intended to connect specific neighborhoods of Austin to Downtown Austin, with frequent stops.

MetroRapid: Capital Metro’s MetroRapid routes is an ostensibly bus rapid transit service serving high-traffic corridors. The service utilizes high-frequency service of every 15 minutes on weekdays with 10 minute service at rush hours.

UT Shuttle: The UT Shuttle system includes a number of routes during the University of Texas semester. They do not operate on Saturdays, except during finals.

Crosstown: Capital Metro’s Crosstown routes are local services between two neighborhoods of Austin, for which the route does not pass through Downtown Austin or the University of Texas.

Limited & Flyer: Capital Metro’s Limited and Flyer routes are limited stop services between two destinations. Limited routes tend to have fewer stops compared to their local counterparts, while Flyer routes serve nonstop between downtown or the UT campus and their neighborhoods of service.

Feeder: Capital Metro’s Feeder routes are local services between a neighborhood and a major transfer point for connecting service.

2.3.2 Hotlines

What makes a good bus system? What’s so special about the ‘hotlines’?

The following analysis aim to find out what routes are popular, why are they popular, and how they have changed in a micro perspective. Kmeans Cluster Analysis was used to separate the disaggregated data into groups. Kmeans is an unsupervised learning algorithm that automatically group the dataset based on the distribution of each feature. We intend to use this algorithm to see if the resulting grouping identifies the hotlines, i.e. the routes that have higher ridership.

We looked at the Kmeans analysis both before and after the Cap Remap. We get 4 lines labeled as hotlines before the remap, 6 lines labeled as hotlines after the remap. The hotlines before and after the remap are plotted below. Most of the hot routes are north-south direction. There are two new hotlines emerged after the CapRemap, line 10 and line 20, and they are colored in red.

To dive deeper into the characteristics of the hot bus lines, we map out the passenger load for each route at each stop for each direction. We also ploted the passenger load versus stop sequence ID as well as average boarding and alighting at each stop along each route. The purpose of this analysis is to first, find out what is so special about the hotlines, and second, see trends before and after the Cap Remap. Note that the Austin bus system has different patterns for each route, and in order to make sure the plots to make sense, we only selected the most used pattern for each plot. Below we chose two Line 20 (type High Frequency) and Line 801(Metro Rapid) to demonstrate detailed route analysis.

Below is the analysis for Line 801.

By mapping and plotting the average passenger number on bus as well as the average boarding and alighting at each stop, we can see better how specific location or neighborhood could potentially contribute to the ridership. These regions will be feature engineered in the following analysis. We also noticed that ridership tends to be higher in the middle portion of the trip, this means a lot of the passengers board from early stops to stops near the ends.

In conclusion, hotlines have the following characteristics:

  • In terms of bus route types, Local, MetroRapid, and High Frequency routes have high ridership
  • In terms of geographical distribution:
    • Go through Hubs (UT, DT, Pleasant Valley)
    • Mostly North-South direction (Following the shape / geography of the city)
    • Going across the a large portion of the city
  • In terms of temporal trend, we know that more Shifts were added in the day time and rush hours, which might increase ridership.

3. Modeling

3.1 Strategies

We will be creating a machine learning model that predicts the ridership at each stop. This model will allow planners to test diffrent scenarios in which large development, land use change or route frequency change could largely impact the system ridership. To make the prediction model more accurate and generalizable, we will look at how Linear, Lasso and Ridge regression, Random Forest, and Xgboost captures the variability in our dataset. We will start with feature engineering, which consists of 5 major categories: amenity, built environment, demographics, route network, stop characteristics (internal data). The hypothesis is that these five categories will influence the ridership at each stop in different ways. The dependent variable we used is the average ridership for each stop in 2019. We use 2019 data because we want to focus on the data after Cap Remap, and our feature engineering aligns with the year 2019 better.

3.2 Feature Engineering

The feature table below demonstrates five types of features and the sources of each feature. All data comes from the following sources: APC aggregated and disaggregated data, Capital Metro, OpenStreetMap (OSM), Open Data Austin, and ACS Census.

In the amenity category, information about where the amenities located is collected from OSM. The examples of the amenities are stadiums, supermarkets, offices, train stations, etc. For amenity, we created buffers of each stop with the size of 1/2 mile, 1/4 mile and 1/8 mile. The number of each type of amenity within the buffer is calculated. In order to capture the distance factor related between stops and amenities, the distance between the stop and the closest 3 amenities are calculated as well.

In the built environment category, land use types, building area, neighborhood fixed effect and school district fixed effect are included. Some features are spatially joined with the stops, such as the neighborhood and the school district data. Other features such as land use and building area, the percentage of each type of land use and the total area of buildings within the buffer is calculated as well. Noted here that the three different buffer sizes are all tested here for capturing more variation in the dataset.

In the demographics category, data about population, median income and car ownership is collected. For demographics, we used Areal Weighted Interpolation and joined the weighted census estimates within the buffer to each stop.

For route information, the percentage of each route type passing each bus stop is calculated. The number of shifts going through each stop on a given week is also calculated.

For the internal (stop characteristics) category, we first added a transit hub dummy defined by Capital Metro; we then calculated the spatial lag, which is the average ridership of the surrounding stops.

A series of analysis is conducted to see what features are important. We first looked at the correlation between all features and the dependent variable, the mean ridership in 2019. Below are selected features that highly correlate with ridership positively and negatively. We found that the route information, amenity distance, and stop characteristics often correlates highly with ridership.

A correlation matrix is made to see potential collinearity between all features as well as the dependent variable. We found that certain features correlate highly, such as route direction feature SouthNorth versus WestEast, land use feature commercial versus residential, amenity feature distance to CBD versus distance to train stations. It is important to identify these variables when using features in the model.

3.3 Results and Validation

As mentioned in the exploratory analysis, neighborhood has been contributing a huge impact on ridership, similarly, a typology division of downtown and UT versus the rest of the Austin also captures the ridership pattern difference a lot becuase of the city’s natural built environments or school schedules. Thus in the modeling validation section, neighborhood and city typology (downtown, UT Austin, and the rest) are used to test the model’s generalizability.

With the features created in five categories, there are four types of models built: simple linear regression model (lm in the following visualizations), lasso and ridge regression model (glmnet in the following visualizations); random forest model (rf in the following visualizations) and xgboost model (xgb in the following visualizations). They will be tested and validated through a generalizability test to see which model fits the best.

Apart from the original 1/4 mile buffer, two more buffers sizes were created during feature engineering. The dataset corresponding to each buffer size will be tested and validated through the generalizability test, and the best buffer size will be selected.

3.3.1 Selecting the Best Model

We tested the generalizability of the four models mentioned above by holding out each neighborhood each time, use the rest data to train the model, and compare and calculate the prediction error of the hold-out neighborhood.

The MAPE and MAE of the four models reveals that, genrally speaking, random forest model is the model with the best accuracy while lasso and ridge model gives prediction less accurately than other three. Lasso and ridge regression has the largest MAPE and MAE while random forest model has the least MAPE and MAE.

In terms of predicted value and actual values, simple linear regression tends to underpredict ridership when the actual ridership is higher than 300.Lasso and ridge model generally overpredict ridership. Random forest model tends to overpredict ridership when the actual ridership is over 250. Xgboost model generally overpredict ridership but performs better than glmnet model when the actual ridership is low.

In order to test the the generaliability of the models on neighborhoods, the following bar chart reveals each model’s MAE in the neighborhoods. The charts demonstrate most of the MAEs are below 100. There are several neighborhoods appear to have a higher MAE among which most of them gathered around UT.

To take a closer look at the neighborhoods, maps of MAPE are plotted to show the generalizability. As mentioned before, the model’s acuuracy is lower in the neighborhoods around University of Texas.

3.3.2 Selecting the Best Buffer Size

It is hard to arbitrarily decide what the best buffer size is in capturing most of the variations of the dataset. As the bus stops are relatively densely distributed in Austin, we start from 1/2 mile and gradually reduce the size to 1/4 mile and 1/8 mile. By comparing the r square, RMSE, MAE and MAPE of each model with different buffer size, and further testing the generalizability of the models, we will get the best buffer size.

1/2 mile Buffer

In each buffer size, four types of models are compared.For each buffer size, the model with the best performance will be picked and enter the final comparison. In 1/2 mile buffer size test, the random forest model has the largest r-square of 0.81 and lowest MAE of 64 and MAPE of 30.4%. Among the four models, this rf model has the best performance.

1/4 mile Buffer

In 1/4 mile buffer size test, the random forest model has the largest r-square of 0.79 and lowest MAE of 69.3 and MAPE of 28.8%. This rf model has the best performance among the four models here.

1/8 mile Buffer

In 1/8 mile buffer size test, the random forest model has the largest r-square of 0.72 and lowest MAE of 77.8 and MAPE of 33.6%. This rf model has the best performance among the four models here.

After the above comparison, the largest r-square is from 1/2-mile buffer while the least MAPE is from 1/4-mile. The possible reason for the lose of 1/8-mile buffer is it cannot sucessfully capture enough variation within the limited radius of the stops. Especially for the areas where ridership is high, such as CBD or UT, there can be a lot of variances outside the 1/8 mile.

Then if we take a look at the generalizability of the three buffer sizes on their performance in CBD, UT and the rest areas, it is clear that 1/8 mile buffer fails to achieve generaliability as well as other two buffer sizes. the MAE in CBD and UT are very high. For the comparison between 1/2 mile buffer and 1/4 mile buffer, the MAE demonstrates a better generalizability for 1/2 mile buffer for its smaller value in UT. However, considering our use case where we allow planners to test different development at specific locations, a smaller buffer size would more accurately reflect real development in the world. Thus, the final buffer radius is settled to 1/4 mile.

4. Application

(Final model will be built with Random Forest algorithm, but for now, we will be showcasing results from linear regression.)

4.1 What is the impact on transit ridership with new real estate development?

4.1.1 Improving the Model and Check Generalizability

We want to build a model that has high predictive power according to building density change. Previously, our model includes all the features engineered. Including every feature in a model might be cause certain features to have less predictive power. Thus, we subsetted a collection of features that gives the highest accuracy while keeping the building density feature having strong influence to the result. The original model with all features has an R squared of 0.7885 but the building density feature is not significant in the model. Even though our current model only has an R squared of 0.6927, we gained the significance of the building density feature in the model.

To check the predicting accuracy of the model, we looked at the predicting error by neighborhood.

It is evident that in terms of the error percentage, our model is already pretty good with low error percentage accorss the whole city. In terms of the absolute values of the prediction errors, we see that there are certain regions that are prone to higher errors, which we suspect relates to the total riderships of the regions. If we switch to Random Foreset model in the future, we will see a better accuracy.

4.1.2 Scenario 1: What is the effect of UT Austin West Campus Upzone on the bus ridership?

In May 2019, the West Campus neighborhood, just west of UT Austin, passed an up-zoning change, which allows the buildings in the area to be increase their heights. We assumed that the existing buildings all build to the maximum heights.

As shown on the maps, the ridership is predicted to change in the West Campus area. However, in the CBD, the ridership dropped drastically and the ridership across the city barely moved. So it looks like the model can reflect how building area increase influences the ridership in an area, yet still needs to be improved in system-wide prediction. This issue also appears in the next two scenarios.

4.1.3 Scenario 2: What is the effect of East Riverside Development on the bus ridership?

The 4700 East Riverside Drive development, which is a 97-acre mixed use development that just got a green light from the city council in October 2019. This development will construct 4,709 multifamily units, 600 hotel rooms, 4 million square feet of office space and 435,000 square feet of ground-floor commercial space.

In this scenario, the prediction again, surged in the development area. And the CBD also experiences ridership increase. However, the ridership change over the entire city still seem to be a subject to be further examined.

4.1.4 Scenario 3: What is the effect of Mueller Austin Housing Project on the bus ridership?

In the last scenario, we chose the Mueller Austin development, which is 711-acre planned community with transit-oriented and mixed use development northeast of CBD. This development project comprise 4.2 million square feet of non-residential development, 650,000 square feet of retail space, 4,600 homes, and 140 acres of open space. An estimated 10,000 permanent jobs within the development will have been created by the time it is complete.

The outcome of this model shows similar result as the previous two, which is the ridership increases majorly in the development area yet seems to drag down ridership in other parts of the city and lead to the decrease in the overall ridership. There almost seems to be a trade-off between the riderhisp in different areas.

4.2 Web Application

We developed a web application according to our use case: helping planners to evision the potential change to transit ridership with respect to the change in real-estate. The following screenshots demonstrate the prototype. The first three screen shots demonstrate how users can investigate the current ridership patterns, and the last screen shows that the web app can give ridership projections according when the user selects any particular scenarios.

5. Appendix

Setup

######### Set Up Functions and Plotting Options ######### 
mapTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2)
  )
}

qBr <- function(df, variable, rnd) {
  if (missing(rnd)) {
    as.character(quantile(round(df[[variable]],0),
                          c(.01,.2,.4,.6,.8), na.rm=T))
  } else if (rnd == FALSE | rnd == F) {
    as.character(formatC(quantile(df[[variable]]), digits = 3),
                 c(.01,.2,.4,.6,.8), na.rm=T)
  }
}

q5 <- function(variable) {as.factor(ntile(variable, 5))}

plotTheme <- theme(
  plot.title =element_text(size=12),
  plot.subtitle = element_text(size=8),
  plot.caption = element_text(size = 6),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title.y = element_text(size = 10),
  panel.background=element_blank(),
  plot.background=element_blank(),
  axis.ticks=element_blank())

plotTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=1.5),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=12),
    axis.title = element_text(size=12),
    axis.text = element_text(size=10),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic"),
    legend.text = element_text(colour = "black", face = "italic"),
    strip.text.x = element_text(size = 14)
  )
}

Background

Data Structure

#turn dataframe into spacitial object
agg_sf <- agg%>%
  st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs = 4326)%>%
  st_transform(2278)

disagg_sf <- disagg%>%
  st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs = 4326)%>%
  st_transform(2278)

# We use aggregated data to look at the average ridership on weekdays at individual stops
ggplot()+
  geom_sf(data = subset(serviceArea,NAME == "Austin"), aes(fill = "Austin"))+
  scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
                    guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA)))+
  geom_sf(data = subset(agg_after_sf, STOP_ID == 476), aes(color = "Stop 476"), size = 2, show.legend = "point")+
  scale_colour_manual(values = c("Stop 476" = "darkorange"),
                      guide = guide_legend("Aggregated Data Example"))+
  labs(title = "Aggregated Data Structure",
       subtitle = "Data from Capital Metro")+
  ggrepel::geom_label_repel(
    data = subset(agg_after_sf, STOP_ID == 476),aes(label = "Average Ridership = 33 \n Average Passing Buses = 55", geometry = geometry),
    stat = "sf_coordinates",
    min.segment.length = 3)+mapTheme()

# We use disaggregated data to investigate the average ridership on weekdays on different routes.
disagg_803 <- subset(disagg_sf, ROUTE == 803)%>%
  group_by(STOP_ID)%>%
  summarize(avg_on = mean(PSGR_ON),
            avg_load = mean(PSGR_LOAD))
ggplot()+
  geom_sf(data = subset(serviceArea,NAME == "Austin"), aes(fill = "Austin"))+
  scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
                    guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA)))+
  geom_sf(data = disagg_803, aes(color = "Stops on Route 803"), size = 2, show.legend = "point")+
  scale_colour_manual(values = c("Stops on Route 803" = "darkorange"),
                      guide = guide_legend("Disggregated Data Example"))+
  labs(title = "Disaggregated Data Structure",
       subtitle = "Data from Capital Metro")+
  geom_label_repel(
    data = subset(disagg_803, STOP_ID == 2606),aes(label = "Average On-board Passengers of Stop 2606 = 11 \n Route Type = Metro Rapid", geometry = geometry),
    stat = "sf_coordinates",
    min.segment.length = 0,
    segment.color = "lightgrey",
    point.padding = 20)+mapTheme()

Route Types Changes

# Crosstown
crosstown <-ggplot()+
  geom_sf(data = subset(serviceArea, NAME == "Austin"), aes(fill = "Austin"))+
  scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
                    guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA))) +
  geom_sf(data = subset(Routes1801, ROUTETYPE == "Crosstown"), color = "greenyellow",lwd = 0.8,show.legend = FALSE)+
  geom_sf(data = subset(Routes2001, ROUTETYPE == "Crosstown"), color = "greenyellow",lwd = 0.8,show.legend = FALSE)+
  facet_grid(~capremap)+
  labs(title = "Crosstown Routes Before and After Cap Remap")+mapTheme()

# Feeder
feeder <-ggplot()+
  geom_sf(data = subset(serviceArea, NAME == "Austin"), aes(fill = "Austin"))+
  scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
                    guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA))) +
  geom_sf(data = subset(Routes1801, ROUTETYPE == "Feeder"), color = "lightcoral",lwd = 0.8, show.legend = FALSE)+
  geom_sf(data = subset(Routes2001, ROUTETYPE == "Feeder"), color = "lightcoral",lwd = 0.8, show.legend = FALSE)+
  facet_grid(~capremap)+
  labs(title = "Feeder Routes Before and After Cap Remap")+mapTheme()


# Flyer
flyer <- ggplot()+
  geom_sf(data = subset(serviceArea, NAME == "Austin"), aes(fill = "Austin"))+
  scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
                    guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA))) +
  geom_sf(data = subset(Routes1801, ROUTETYPE == "Flyer"), color = "magenta2",lwd = 0.8,show.legend = FALSE)+
  geom_sf(data = subset(Routes2001, ROUTETYPE == "Flyer"), color = "magenta2",lwd = 0.8,show.legend = FALSE)+
  facet_grid(~capremap)+
  labs(title = "Flyer Routes Before and After Cap Remap")+mapTheme()

# Express
express <-ggplot()+
  geom_sf(data = subset(serviceArea, NAME == "Austin"), aes(fill = "Austin"))+
  scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
                    guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA))) +
  geom_sf(data = subset(Routes1801, ROUTETYPE == "Express"), color = "red",lwd = 0.8,show.legend = FALSE)+
  geom_sf(data = subset(Routes2001, ROUTETYPE == "Express"), color = "red",lwd = 0.8,show.legend = FALSE)+
  facet_grid(~capremap)+
  labs(title = "Express Routes Before and After Cap Remap")+mapTheme()

# Special
special <- ggplot()+
  geom_sf(data = subset(serviceArea, NAME == "Austin"), aes(fill = "Austin"))+
  scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
                    guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA))) +
  geom_sf(data = subset(Routes1801, ROUTETYPE == "Special"), color = "seashell2",lwd = 0.8,show.legend = FALSE)+
  geom_sf(data = subset(Routes2001, ROUTETYPE == "Special"), color = "seashell2",lwd = 0.8,show.legend = FALSE)+
  facet_grid(~capremap)+
  labs(title = "Speical Routes Before and After Cap Remap")+mapTheme()

# minor changes grid arrange

grid.arrange(crosstown, feeder, flyer, express, ncol =2)

Exploratory Analysis

Ridership Typology

#create stop shapefile
agg_sf <- agg%>%
  st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs = 4326)%>%
  st_transform(2278)

agg_sf19 <- agg_sf%>%
  filter(YEAR_ID == 2019)%>%
  group_by(STOP_ID)%>%
  summarize(avg_on = mean(AVERAGE_ON))

#Read UT and CBD shapefiles
UT <- st_read("D:/Spring20/Practicum/data/UTAustin/UT.shp")%>%
  st_transform(2278)
CBD <- st_read("D:/Spring20/Practicum/data/CBD/CBD.shp")%>%
  st_transform(2278)

#create shapefile of area outside of CBD
nhood_CBD <- st_difference(nhood_merge, CBD)

#ST_DIFFERENCE didnt work for UT; created in Arcmap
nhood_UT <- st_read("D:/Spring20/Practicum/data/nhood_UT.shp")%>%
  st_as_sf()%>%
  st_transform(2278)

#Create CBD typology
agg_sf19_CBD <- st_join(CBD, agg_sf19, join = st_contains)%>%
  mutate(typology = "CBD")

agg_sf19_oCBD <- st_join(nhood_CBD, agg_sf19, join = st_contains)%>%
  mutate(typology = "oCBD")%>%
  rename(geometry = x)

agg_sf19_oCBD <- agg_sf19_oCBD%>%
  group_by(Id)%>%
  summarize(avg_on = mean(avg_on))%>%
  mutate(label = "The Rest of Austin")

agg_CBD_typology <- rbind(agg_sf19_CBD,agg_sf19_oCBD)

#Create UT typology
agg_sf19_UT <- st_join(UT, agg_sf19, join = st_contains)%>%
  mutate(typology = "UT")

agg_sf19_UT <- agg_sf19_UT%>%
  group_by(Id)%>%
  summarize(avg_on = mean(na.omit(avg_on)))%>%
  mutate(label = "UT Austin")

agg_sf19_oUT <- st_join(nhood_UT, agg_sf19, join = st_contains)%>%
  mutate(typology = "oUT")%>%
  select(STOP_ID,
         avg_on,
         typology,
         geometry)

agg_sf19_oUT <- agg_sf19_oUT%>%
  group_by(Id)%>%
  summarize(avg_on = mean(na.omit(avg_on)))%>%
  mutate(label = "The Rest of Austin")

agg_UT_typology <- rbind(agg_sf19_UT,agg_sf19_oUT)

How did ridership change before and after CapRemap (06/03, 2018)?

Ridership Change in Different Neighborhoods in Austin in 2018

Identify the Hotlines

First, let us look at the Kmeans analysis before the CapRemap. We group the disaggregated data by routes, and calculated the max and mean number of passengers on bus at each stop, the average miles traveled and the average hours spent for each passenger at each stop, as well as the total run length and total run time of the route.

Then, we apply Kmeans analysis. The number of clusters are determined by both the elbow chart and the 26 criteria provided by the NbClus package. For more information, see appendix.

We do the same analysis to the disaggregated dataset after the CapRemap.

These clustering labels are joined to the original dataset. For more about the clustering result, please see appendix.

Find the number of kmeans clusters for both before and after the CapRemap:

Both the Elbow chart and the 26 indicies provided by the NbClust package are used to check how many clusters should be used in the Kmeans analysis.

Before CapRemap:

After CapRemap:

In either case, it is evident that the most optimal number for the Kmeans cluster analysis is 3. We then conduct Kmeans analysis with 3 clusters as mentioned above in the exploratory analysis section.

Here is the Kmeans analysis result we got for before and after the CapRemap. The numbers are average of each feature used in the Kmeans analysis. We can clearly see that cluster 2 for both before and after the remapping have the highest average ridership as well as run times. They also have the smallest size. We can conclude that these are the most popular routes and we then define these routes as ‘hotlines’.

routeplot1 <- function(n,p,p1,d) {
  # line n before map
  t1 = ggplot() +
  geom_sf(data = nhood, color = 'grey30',fill = 'grey20') +
  geom_sf(data = disagn1j %>% filter(ROUTE == n & DIRECTION ==d & PATTERN== p) %>%
            st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs = 4326, agr = "constant") %>%
            st_transform(2278) %>%
            group_by(STOP_ID) %>%
            summarise(mean_stop_load = mean(PSGR_LOAD),size = 0.8), 
          aes(color = mean_stop_load))+
  scale_color_gradientn(colors = c("#0c2c84","#41b6c4", "#ffffcc"), limits = c(0,25), 
                       breaks = c(0, 5, 10, 15, 20, 25)) +
  labs(title=paste("Line",n,"Direction 1, Before CapRemap"),
  subtitle = "Average Number of\nPassengers at Each Stop")+mapTheme()
  
  #line n before passenger load chart
  t11 = ggplot(data = disagn1j %>% filter(ROUTE == n & DIRECTION ==d & PATTERN == p) %>%
         group_by(STOP_SEQ_ID) %>%
         summarise(mean_on = mean(PSGR_ON), mean_off = mean(PSGR_OFF), 
                   mean_load = mean(PSGR_LOAD))) +
    geom_path(aes(x = STOP_SEQ_ID, y = mean_load, 
                size = mean_load, color = mean_load), lineend="round",linejoin="mitre")+
    scale_color_gradientn(colors = c("#0c2c84","#41b6c4", "#ffffcc"), limits = c(0,25), 
                       breaks = c(0, 5, 10, 15, 20, 25))+
    scale_size_continuous()+
    ylim(0, 23) +
    labs(subtitle=paste("Average Passenger Load"))+plotTheme()+ 
    theme(legend.position="none")
  
  #line n before passenger boarding and alighting
  t12 = ggplot(data = disagn1j %>% filter(ROUTE == n & DIRECTION ==d & PATTERN == p) %>%
         group_by(STOP_SEQ_ID) %>%
         summarise(mean_on = mean(PSGR_ON), mean_off = mean(PSGR_OFF), 
                   mean_load = mean(PSGR_LOAD))) +
    geom_ribbon(aes(x = STOP_SEQ_ID,ymin=0,ymax=mean_on), fill="#9999CC", alpha="0.25") +
    geom_ribbon(aes(x = STOP_SEQ_ID,ymin=0,ymax=mean_off), fill="#9999CC", alpha="0.25") +
    geom_line(aes(x = STOP_SEQ_ID, y = mean_on, color = "Average Boarding"), size=1) + 
    geom_line(aes(x = STOP_SEQ_ID, y = mean_off, color = "Average Alighting"), size=1)+ 
    ylim(0, 10) +
    labs(subtitle=paste("Average Boarding/Alighting"))+plotTheme()+ 
    theme(legend.position="bottom", legend.box = "horizontal")
  
  # line n after map
  t2 = ggplot() +
  geom_sf(data = nhood, color = 'grey30',fill = 'grey20') +
  geom_sf(data = disagn2j %>% filter(ROUTE == n & DIRECTION ==d & PATTERN== p1) %>%
            st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs = 4326, agr = "constant") %>%
            st_transform(2278) %>%
            group_by(STOP_ID) %>%
            summarise(mean_stop_load = mean(PSGR_LOAD),size = 0.8), 
          aes(color = mean_stop_load))+
  scale_color_gradientn(colors = c("#0c2c84","#41b6c4", "#ffffcc"), limits = c(0,25), 
                       breaks = c(0, 5, 10, 15, 20, 25)) +
  labs(title=paste("Line",n,"Direction 1, After CapRemap"),
  subtitle = "Average Number of\nPassengers at Each Stop")+mapTheme()
  
  #line n after passenger load chart
  t21 = ggplot(data = disagn2j %>% filter(ROUTE == n & DIRECTION ==d & PATTERN == p1) %>%
         group_by(STOP_SEQ_ID) %>%
         summarise(mean_on = mean(PSGR_ON), mean_off = mean(PSGR_OFF), 
                   mean_load = mean(PSGR_LOAD))) +
    geom_path(aes(x = STOP_SEQ_ID, y = mean_load, 
                size = mean_load, color = mean_load), lineend="round",linejoin="mitre")+
    scale_color_gradientn(colors = c("#0c2c84","#41b6c4", "#ffffcc"), limits = c(0,25), 
                       breaks = c(0, 5, 10, 15, 20, 25))+
    scale_size_continuous()+
    ylim(0, 23) +
    labs(subtitle=paste("Average Passenger Load"))+plotTheme()+ 
    theme(legend.position="none")
  
  #line n after passenger boarding and alighting
  t22 = ggplot(data = disagn2j %>% filter(ROUTE == n & DIRECTION ==d & PATTERN == p1) %>%
         group_by(STOP_SEQ_ID) %>%
         summarise(mean_on = mean(PSGR_ON), mean_off = mean(PSGR_OFF), 
                   mean_load = mean(PSGR_LOAD))) +
    geom_ribbon(aes(x = STOP_SEQ_ID,ymin=0,ymax=mean_on), fill="#9999CC", alpha="0.25") +
    geom_ribbon(aes(x = STOP_SEQ_ID,ymin=0,ymax=mean_off), fill="#9999CC", alpha="0.25") +
    geom_line(aes(x = STOP_SEQ_ID, y = mean_on, color = "Average Boarding"), size=1) + 
    geom_line(aes(x = STOP_SEQ_ID, y = mean_off, color = "Average Alighting"), size=1)+ 
    ylim(0, 10) +
    labs(subtitle=paste("Average Boarding/Alighting"))+plotTheme()+ 
    theme(legend.position="bottom", legend.box = "horizontal")
  
  grid.arrange(arrangeGrob(t1, t11, t12, ncol = 1, nrow = 3),
               arrangeGrob(t2, t21, t22, ncol = 1, nrow = 3),ncol=2)
}

Feature Engineering

Amenities (use buffer)

Open Street Map (OSM) Amenity Counts

######### Get OSM Data #########
getOSM <- function(key,value){
  feature <- opq(bbox = 'Austin, Texas')%>%
    add_osm_feature(key = key, value = value) %>%
    osmdata_sf ()
  if(is.null(feature$osm_points)){
    feature_poly <- feature$osm_polygons%>%
      select(osm_id,geometry)%>%
      st_as_sf(coords = geometry, crs = 4326, agr = "constant")%>%
      st_transform(2278)
    return(feature_poly)
  } else {
  feature_pt <- feature$osm_points%>%
    select(osm_id,geometry)%>%
    st_as_sf(coords = geometry, crs = 4326, agr = "constant")%>%
    st_transform(2278)
  return (feature_pt)
  }
}

#commercial
commercial <- getOSM('building', 'commercial')
#retail
retail <- getOSM('building', 'retail')
#supermarket
supermkt <- getOSM('building', 'supermarket')
#office
office <- getOSM('building', 'office')
#residential
residential <- getOSM('building','residential')
#bar
bar <- getOSM('amenity', 'bar')
#school
school <- getOSM('amenity', 'school')
#uni
university <- getOSM('amenity', 'university')
#parking
parking <- getOSM('amenity', 'parking')
#statium
stadium <- getOSM('building', 'stadium')
#trainstation
trainstation <- getOSM('building', 'train_station')

######### spatial join #########
bufferInit <- function(Buffer, Points, Name){
  if(class(Points$geometry) == "sfc_POINT"){
  Init <- st_join(Buffer%>% select(STOP_ID), Points, join = st_contains)%>%
  group_by(STOP_ID)%>%
    summarize(count = n())%>%
    rename(!!Name := count)
  }else {
    Init <- st_join(Buffer%>% select(STOP_ID), Points, join = st_intersects)%>%
      group_by(STOP_ID)%>%
      summarize(count = n())%>%
      rename(!!Name := count)
  }
}

Amenity Distance

Built Environments

Demographics

######### census #########
options(tigris_use_cache = TRUE)
v17 <- load_variables(2017, "acs5", cache = TRUE)

Hays <- get_acs(state = "48", county = "209", geography = "tract", 
                variables = "B01001_001", geometry = TRUE)
Travis <- get_acs(state = "48", county = "453", geography = "tract", 
                  variables = "B01001_001", geometry = TRUE)
Williamson <- get_acs(state = "48", county = "491", geography = "tract", 
                      variables = "B01001_001", geometry = TRUE) 

Travis_race <- get_acs(state = "48", county = "453", geography = "tract", 
                       variables = "B02001_002", geometry = TRUE)
Williamson_race <- get_acs(state = "48", county = "491", geography = "tract", 
                           variables = "B02001_002", geometry = TRUE) 

Travis_noveh <- get_acs(state = "48", county = "453", geography = "tract", 
                        variables = "B08014_002", geometry = TRUE)
Williamson_noveh <- get_acs(state = "48", county = "491", geography = "tract", 
                            variables = "B08014_002", geometry = TRUE)

Travis_oneveh <- get_acs(state = "48", county = "453", geography = "tract", 
                        variables = "B08014_003", geometry = TRUE)
Williamson_oneveh <- get_acs(state = "48", county = "491", geography = "tract", 
                            variables = "B08014_003", geometry = TRUE)

Travis_twoveh <- get_acs(state = "48", county = "453", geography = "tract", 
                         variables = "B08014_004", geometry = TRUE)
Williamson_twoveh <- get_acs(state = "48", county = "491", geography = "tract", 
                             variables = "B08014_004", geometry = TRUE)

Travis_threeveh <- get_acs(state = "48", county = "453", geography = "tract", 
                         variables = "B08014_005", geometry = TRUE)
Williamson_threeveh <- get_acs(state = "48", county = "491", geography = "tract", 
                             variables = "B08014_005", geometry = TRUE)

Travis_fourveh <- get_acs(state = "48", county = "453", geography = "tract", 
                           variables = "B08014_006", geometry = TRUE)
Williamson_fourveh <- get_acs(state = "48", county = "491", geography = "tract", 
                               variables = "B08014_006", geometry = TRUE)

Travis_fiveveh <- get_acs(state = "48", county = "453", geography = "tract", 
                          variables = "B08014_007", geometry = TRUE)
Williamson_fiveveh <- get_acs(state = "48", county = "491", geography = "tract", 
                              variables = "B08014_007", geometry = TRUE)

Travis_poverty <- get_acs(state = "48", county = "453", geography = "tract", 
                          variables = "B06012_002", geometry = TRUE)
Williamson_poverty <- get_acs(state = "48", county = "491", geography = "tract", 
                              variables = "B06012_002", geometry = TRUE)

Travis_medInc <- get_acs(state = "48", county = "453", geography = "tract", 
                          variables = "B19013_001", geometry = TRUE)
Williamson_medInc <- get_acs(state = "48", county = "491", geography = "tract", 
                              variables = "B19013_001", geometry = TRUE)
######### buffer deomographics #########
#population
Population <- rbind(Travis, Williamson)%>%
  st_transform(2278)
Population_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = Population, sid = GEOID, weight = "sum",
                                  output = "sf", extensive = "estimate")
Population_buff$estimate<- round(Population_buff$estimate)

#race
Race <- rbind(Travis_race, Williamson_race)%>%
  st_transform(2278)
Race_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = Race, sid = GEOID, weight = "sum",
                            output = "sf", extensive = "estimate")
Race_buff$estimate <- round(Race_buff$estimate)

#vehicle ownership
NoVeh <- rbind(Travis_noveh, Williamson_noveh)%>%
  st_transform(2278)
NoVeh_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = NoVeh, sid = GEOID, weight = "sum",
                             output = "sf", extensive = "estimate")
NoVeh_buff$estimate <- round(NoVeh_buff$estimate)


OneVeh <- rbind(Travis_oneveh, Williamson_oneveh)%>%
  st_transform(2278)
OneVeh_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = OneVeh, sid = GEOID, weight = "sum",
                              output = "sf", extensive = "estimate")
OneVeh_buff$estimate <- round(OneVeh_buff$estimate)


TwoVeh <- rbind(Travis_twoveh, Williamson_twoveh)%>%
  st_transform(2278)
TwoVeh_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = TwoVeh, sid = GEOID, weight = "sum",
                              output = "sf", extensive = "estimate")
TwoVeh_buff$estimate <- round(TwoVeh_buff$estimate)


ThreeVeh <- rbind(Travis_threeveh, Williamson_threeveh)%>%
  st_transform(2278)
ThreeVeh_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = ThreeVeh, sid = GEOID, weight = "sum",
                                output = "sf", extensive = "estimate")
ThreeVeh_buff$estimate <- round(ThreeVeh_buff$estimate)


FourVeh <- rbind(Travis_fourveh, Williamson_fourveh)%>%
  st_transform(2278)
FourVeh_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = FourVeh, sid = GEOID, weight = "sum",
                               output = "sf", extensive = "estimate")
FourVeh_buff$estimate <- round(FourVeh_buff$estimate)


FiveVeh <- rbind(Travis_fiveveh, Williamson_fiveveh)%>%
  st_transform(2278)

FiveVeh_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = FiveVeh, sid = GEOID, weight = "sum",
                               output = "sf", extensive = "estimate")
FiveVeh_buff$estimate <- round(FiveVeh_buff$estimate)


#poverty
Poverty <- rbind(Travis_poverty, Williamson_poverty)%>%
  st_transform(2278)
Poverty_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = Poverty, sid = GEOID, weight = "sum",
                               output = "sf", extensive = "estimate")
Poverty_buff$estimate <- round(Poverty_buff$estimate)


#MedInc
medInc <- rbind(Travis_medInc, Williamson_medInc)%>%
  st_transform(2278)
medInc_stop <- st_join(stops, medInc, join = st_intersects)
medInc_stop <- medInc_stop %>%
  st_drop_geometry() %>%
  select(STOP_ID, estimate) %>%
  rename(medInc = estimate)

Route Network

use datasets after cap remap

Join All Features

all_x1 <- CommercialInit %>%  #amenities and route related
  left_join(st_drop_geometry(RetailInit), by = "STOP_ID") %>%
  left_join(st_drop_geometry(OfficeInit), by = "STOP_ID") %>%
  left_join(st_drop_geometry(ResidentialInit), by = "STOP_ID") %>%
  left_join(st_drop_geometry(SupermktInit), by = "STOP_ID") %>%
  left_join(st_drop_geometry(BarInit), by = "STOP_ID") %>%
  left_join(st_drop_geometry(UniInit), by = "STOP_ID") %>%
  left_join(st_drop_geometry(ParkingInit), by = "STOP_ID") %>%
  left_join(st_drop_geometry(SchoolInit), by = "STOP_ID") %>%
  left_join(st_drop_geometry(StationInit), by = "STOP_ID") %>%
  left_join(st_drop_geometry(StadiumInit), by = "STOP_ID") %>%
  left_join(stop_dir_freq, by = "STOP_ID") %>%
  left_join(stop_type_freq, by = "STOP_ID") %>%
  left_join(stop_hot_freq, by = "STOP_ID") %>%
  left_join(build_dens, by = "STOP_ID") %>%
  left_join(st_drop_geometry(stop_buff_landuse), by = "STOP_ID") %>%
  left_join(st_drop_geometry(Race_buff) %>% rename(race = estimate) %>% select(STOP_ID, race) %>% mutate(STOP_ID = as.numeric(STOP_ID), race = as.numeric(race)), by = "STOP_ID") %>% #census data
  left_join(st_drop_geometry(Population_buff) %>%  rename(population = estimate) %>% select(STOP_ID, population) %>% mutate(STOP_ID = as.numeric(STOP_ID), population = as.numeric(population)), by = "STOP_ID") %>% 
  left_join(st_drop_geometry(NoVeh_buff) %>%  rename(NoVeh = estimate) %>% select(STOP_ID, NoVeh) %>% mutate(STOP_ID = as.numeric(STOP_ID), NoVeh = as.numeric(NoVeh)), by = "STOP_ID") %>%
  left_join(st_drop_geometry(OneVeh_buff) %>%  rename(OneVeh = estimate) %>% select(STOP_ID, OneVeh) %>% mutate(STOP_ID = as.numeric(STOP_ID), OneVeh = as.numeric(OneVeh)), by = "STOP_ID") %>%
  left_join(st_drop_geometry(TwoVeh_buff) %>%  rename(TwoVeh = estimate) %>% select(STOP_ID, TwoVeh) %>% mutate(STOP_ID = as.numeric(STOP_ID), TwoVeh = as.numeric(TwoVeh)), by = "STOP_ID") %>%
  left_join(st_drop_geometry(ThreeVeh_buff) %>%  rename(ThreeVeh = estimate) %>% select(STOP_ID, ThreeVeh) %>% mutate(STOP_ID = as.numeric(STOP_ID), ThreeVeh = as.numeric(ThreeVeh)), by = "STOP_ID") %>%
  left_join(st_drop_geometry(FourVeh_buff) %>%  rename(FourVeh = estimate) %>% select(STOP_ID, FourVeh) %>% mutate(STOP_ID = as.numeric(STOP_ID), FourVeh = as.numeric(FourVeh)), by = "STOP_ID") %>%
  left_join(st_drop_geometry(FiveVeh_buff) %>%  rename(FiveVeh = estimate) %>% select(STOP_ID, FiveVeh) %>% mutate(STOP_ID = as.numeric(STOP_ID), FiveVeh = as.numeric(FiveVeh)), by = "STOP_ID") %>%
  left_join(st_drop_geometry(Poverty_buff) %>%  rename(Poverty = estimate) %>% select(STOP_ID, Poverty) %>% mutate(STOP_ID = as.numeric(STOP_ID), Poverty = as.numeric(Poverty)), by = "STOP_ID") %>%
  left_join(medInc_stop, by= "STOP_ID") %>%
  left_join(st_drop_geometry(stop_nhood), by = "STOP_ID") %>% # fixed effects
  left_join(st_drop_geometry(stop_school), by = "STOP_ID") %>%
  select(-c(hotline_0)) %>%
  left_join(data.2019.mean, by = "STOP_ID")

#spatial lag, knn dist
all_x3 = bind_cols(list(all_x1, utaustinDist, CBDDist, commercialDist, retailDist, supermktDist, officeDist, residentialDist, schoolDist, universityDist, parkingDist, stadiumDist, trainstationDist, airportDist, spatial_lag %>% select(spatial_lag2, spatial_lag3, spatial_lag5)))
#recategorize variables
all_x4_normalize <-
  all_x3 %>% 
  mutate(Clockwise_cat = case_when(
      Clockwise == 0 ~ "0",
      Clockwise == 1 ~ "1",
      Clockwise > 0 & Clockwise <1 ~ "others"),
    Counterclockwise_cat = case_when(
      Counterclockwise == 0 ~ "0",
      Counterclockwise == 1 ~ "1",
      Counterclockwise > 0 & Counterclockwise <1 ~ "others"),
    Crosstown_cat = case_when(
      Crosstown == 0 ~ "0",
      Crosstown == 1 ~ "1",
      Crosstown > 0 & Crosstown <1 ~ "others"),
    Express_cat = case_when(
      Express == 0 ~ "0",
      Express == 1 ~ "1",
      Express > 0 & Express <1 ~ "others"),
    Feeder_cat = case_when(
      Feeder == 0 ~ "0",
      Feeder == 1 ~ "1",
      Feeder > 0 & Feeder <1 ~ "others"),
    Flyer_cat = case_when(
      Flyer == 0 ~ "0",
      Flyer == 1 ~ "1",
      Flyer > 0 & Flyer <1 ~ "others"),
    HighFreq_cat = case_when(
      `High Frequency` == 0 ~ "0",
      `High Frequency` == 1 ~ "1",
      `High Frequency` > 0 & `High Frequency` <1 ~ "others"),
    hotline_cat = case_when(
      hotline_1 == 0 ~ "0",
      hotline_1 == 1 ~ "1",
      hotline_1 > 0 & hotline_1 <1 ~ "others"),
    InOut_cat = case_when(
      InOut == 0 ~ "0",
      InOut == 1 ~ "1",
      InOut > 0 & Flyer <1 ~ "others"),
    Local_cat = case_when(
      Local == 0 ~ "0",
      Local == 1 ~ "1",
      Local > 0 & Local <1 ~ "others"),
    NightOwl_cat = case_when(
      `Night Owl` == 0 ~ "0",
      `Night Owl` == 1 ~ "1",
      `Night Owl` > 0 & `Night Owl` <1 ~ "others"),
    SN_cat = case_when(
      SouthNorth == 0 ~ "0",
      SouthNorth == 1 ~ "1",
      SouthNorth > 0 & SouthNorth <1 ~ "others"),
    Special_cat = case_when(
      Special == 0 ~ "0",
      Special == 1 ~ "1",
      Special > 0 & Special <1 ~ "others"),
    utshuttle_cat = case_when(
      `UT Shuttle` == 0 ~ "0",
      `UT Shuttle` == 1 ~ "1",
      `UT Shuttle` > 0 & `UT Shuttle` <1 ~ "others"),
    WE_cat = case_when(
      WestEast == 0 ~ "0",
      WestEast == 1 ~ "1",
      WestEast > 0 & WestEast <1 ~ "others"))

Features Explorations

change features and see results

# plot original predictions
lmreg <- lm(mean_on ~ .,data = all_x4_normalize %>% st_drop_geometry() %>% select(building_area, civic, commercial, residential, industrial, SN_cat, Crosstown_cat, Express_cat, Local_cat, Flyer_cat, NightOwl_cat, HighFreq_cat, InOut_cat,Clockwise_cat, hotline_1,utshuttle_cat, Special_cat, school_count, stadium_count, medInc, nshifts, mean_on))
summary(lmreg)

lm_model0 <-
  all_x4_normalize %>%
  mutate(ridership.Predict = predict(lmreg, all_x4_normalize)) %>%
  mutate(pred_err = ridership.Predict-mean_on,
         pred_err_p = (ridership.Predict-mean_on)/mean_on)

grid.arrange(
ggplot()+
  geom_sf(data = nhood, color = 'grey40',fill = 'grey40') +
  geom_sf(data = st_centroid(na.omit(lm_model0)), aes(color = pred_err),size = 0.9) +
  scale_color_gradientn(colors = c("#b2182b", "#ef8a62", "#fddbc7","#d1d1d1","#67a9cf", "#2166ac"), limits = c(-750,500))+
  labs(title = "Ridership Prediction Error") +
  mapTheme(),

ggplot()+
  geom_sf(data = nhood, color = 'grey40',fill = 'grey40') +
  geom_sf(data = st_centroid(na.omit(lm_model0)), aes(color = pred_err_p),size = 0.9) +
  scale_color_gradientn(colors = c("#b2182b","#ef8a62", "#d1d1d1","#67a9cf","#2166ac"), limits = c(-40,40))+
  labs(title = "Ridership Prediction Error Percentage") +
  mapTheme(),ncol=2)

#agg nhood
nhood0 <- nhood %>% left_join(lm_model0 %>% 
            na.omit() %>%
            group_by(label) %>% 
            summarise(Pred.err.sum = sum(pred_err), total.ridership = sum(mean_on)) %>% 
            select(label, Pred.err.sum, total.ridership) %>%
            st_drop_geometry(), by = "label")

ggplot()+
  geom_sf(data = nhood0, aes(fill = Pred.err.sum)) +
  labs(title = "Prediction Error by Neighborhood") +
  scale_fill_gradientn(colors = c("#b2182b", "#f4a582", "#f7f7f7", "#2166ac"), limits = c(-5600,3000))+
  mapTheme()

ggplot()+
  geom_sf(data = nhood0, aes(fill = total.ridership)) +
  labs(title = "Ridership by Neighborhood") +
  mapTheme()

Building area SCENARIO 1:

Building area SCENARIO 2:

Building area SCENARIO 3:

Modeling and Validation

Use Neighborhood For Validation with Four Types of Models

###################################Modelling part ends here, below are the visualizations##############
#MAPE chart
ggplot(data = val_preds %>% 
         dplyr::select(model, MAPE) %>% 
         distinct() , 
       aes(x = model, y = MAPE, group = 1)) +
  geom_path(color = "blue") +
  geom_label(aes(label = paste0(round(MAPE,1),"%"))) +
  labs(title= "MAPE of each model on the testing set")
  theme_bw()
#MAE chart
ggplot(data = val_preds %>% 
           dplyr::select(model, MAE) %>% 
           distinct() , 
         aes(x = model, y = MAE, group = 1)) +
    geom_path(color = "blue") +
    geom_label(aes(label = paste0(round(MAE,1)))) +
    labs(title= "MAE of each model on the testing set")
  theme_bw()  
  
#Predicted vs Observed
ggplot(val_preds, aes(x =.pred, y = mean_on, group = model)) +
  geom_point() +
  geom_abline(linetype = "dashed", color = "red") +
  geom_smooth(method = "lm", color = "blue") +
  coord_equal() +
  facet_wrap(~model, nrow = 2) +
  labs(title="Predicted vs Observed on the testing set", subtitle= "blue line is predicted value") 
  theme_bw()

#Neighborhood validation
val_MAPE_by_hood <- val_preds %>% 
  group_by(label, model) %>% 
  summarise(RMSE = yardstick::rmse_vec(mean_on, .pred),
            MAE  = yardstick::mae_vec(mean_on, .pred),
            MAPE = yardstick::mape_vec(mean_on, .pred)) %>% 
  ungroup() 
#Barchart of the MAE in each neighborhood
palette4 <- c("#eff3ff", "#bdd7e7","#6baed6","#2171b5")


val_MAPE_by_hood %>%
  dplyr::select(label, model, MAE) %>%
  gather(Variable, MAE, -model, -label) %>%
  ggplot(aes(label, MAE)) + 
  geom_bar(aes(fill = model), position = "dodge", stat="identity") +
  scale_fill_manual(values = "Blues") +
  facet_wrap(~label,scales="free", ncol=6)+
  labs(title = "Mean Absolute Errors by model specification and neighborhood") +
  plotTheme()

#Map of MAE in each neighborhood
#Add geometry to the MAE
MAE.nhood <- merge(nhood, val_MAPE_by_hood, by.x="label", by.y="label", all.y=TRUE)

#Produce the map

#Map: MAPE of lm
MAE.nhood%>%
  filter(model=="lm") %>%
  ggplot() +
  #    geom_sf(data = nhoods, fill = "grey40") +
  geom_sf(aes(fill = q5(MAPE))) +
  scale_fill_brewer(palette = "Blues",
                    aesthetics = "fill",
                    labels=qBr(MAE.nhood,"MAPE"),
                    name="Quintile\nBreaks, (%)") +
  labs(title="MAPE of lm in Neighborhoods") +
  mapTheme()

#Map: MAPE of glmnet
MAE.nhood%>%
  filter(model=="glmnet") %>%
  ggplot() +
  #    geom_sf(data = nhoods, fill = "grey40") +
  geom_sf(aes(fill = q5(MAPE))) +
  scale_fill_brewer(palette = "Blues",
                    aesthetics = "fill",
                    labels=qBr(MAE.nhood,"MAPE"),
                    name="Quintile\nBreaks, (%)") +
  labs(title="MAPE of glmnet in Neighborhoods") +
  mapTheme()
#MAPE of rf
MAE.nhood%>%
  filter(model=="rf") %>%
  ggplot() +
  #    geom_sf(data = nhoods, fill = "grey40") +
  geom_sf(aes(fill = q5(MAPE))) +
  scale_fill_brewer(palette = "Blues",
                    aesthetics = "fill",
                    labels=qBr(MAE.nhood,"MAPE"),
                    name="Quintile\nBreaks, (%)") +
  labs(title="MAPE of rf in Neighborhoods") +
  mapTheme()

#MAPE of xgb
MAE.nhood%>%
  filter(model=="xgb") %>%
  ggplot() +
  #    geom_sf(data = nhoods, fill = "grey40") +
  geom_sf(aes(fill = q5(MAPE))) +
  scale_fill_brewer(palette = "Blues",
                    aesthetics = "fill",
                    labels=qBr(MAE.nhood,"MAPE"),
                    name="Quintile\nBreaks, (%)") +
  labs(title="MAPE of xgb in Neighborhoods") +
  mapTheme()

Testing different buffer size for model accuracy and generalizability

Buffer size 1/2 mile

#1/2 Buffer Size with Typology Test
data.half <- join(all_half, typology, type ="left")
data.half$STOP_ID <- NULL
data.half<-data.half %>%
  drop_na()
data.half$universityDist1<-NULL
#Slipt the data into training and testing sets
data_split.half <- rsample::initial_split(data.half, strata = "mean_on", prop = 0.75)

bus_train.half <- rsample::training(data_split.half)
bus_test.half  <- rsample::testing(data_split.h)
names(bus_train.half)


cv_splits_geo.half <- rsample::group_vfold_cv(bus_train.half,  strata = "mean_on", group = "typology")

#Create recipe
model_rec.half <- recipe(mean_on ~ ., data = bus_train.half) %>% #the "." means using every variable we have in the training dataset
  update_role(typology, new_role = "typology") %>% #This is more like to keep the neighborhood variable out of the model
  step_other(typology, threshold = 0.005) %>%
  step_dummy(all_nominal(), -typology) %>%
  step_log(mean_on) %>% 
  step_zv(all_predictors()) %>%
  step_center(all_predictors(), -mean_on) %>%
  step_scale(all_predictors(), -mean_on) #%>% #put on standard deviation scale
#step_ns(Latitude, Longitude, options = list(df = 4))
?step_cv
model_rec.half

#Build the model
lm_plan <- 
  parsnip::linear_reg() %>% 
  parsnip::set_engine("lm")

glmnet_plan <- 
  parsnip::linear_reg() %>% 
  parsnip::set_args(penalty  = tune()) %>%
  parsnip::set_args(mixture  = tune()) %>%
  parsnip::set_engine("glmnet")

rf_plan <- parsnip::rand_forest() %>%
  parsnip::set_args(mtry  = tune()) %>%
  parsnip::set_args(min_n = tune()) %>%
  parsnip::set_args(trees = 1000) %>% 
  parsnip::set_engine("ranger", importance = "impurity") %>% 
  parsnip::set_mode("regression")

XGB_plan <- parsnip::boost_tree() %>%
  parsnip::set_args(mtry  = tune()) %>%
  parsnip::set_args(min_n = tune()) %>%
  parsnip::set_args(trees = 100) %>% 
  parsnip::set_engine("xgboost") %>% 
  parsnip::set_mode("regression")

#
glmnet_grid <- expand.grid(penalty = seq(0, 1, by = .25), 
                           mixture = seq(0,1,0.25))

rf_grid <- expand.grid(mtry = c(2,5), 
                       min_n = c(1,5))
xgb_grid <- expand.grid(mtry = c(3,5), 
                        min_n = c(1,5))
#Create workflow
lm_wf.half <-
  workflows::workflow() %>% 
  add_recipe(model_rec.half) %>% 
  add_model(lm_plan)

glmnet_wf.half <-
  workflow() %>% 
  add_recipe(model_rec.half) %>% 
  add_model(glmnet_plan)

rf_wf.half <-
  workflow() %>% 
  add_recipe(model_rec.half) %>% 
  add_model(rf_plan)
xgb_wf.half <-
  workflow() %>% 
  add_recipe(model_rec.half) %>% 
  add_model(XGB_plan)
# fit model to workflow and calculate metrics
control <- tune::control_resamples(save_pred = TRUE, verbose = TRUE)
library(tune)
library(yardstick)
?tune_grid
?metric_set

lm_tuned.half <- lm_wf.half %>%
  fit_resamples(.,
                resamples = cv_splits_geo.half,
                control   = control,
                metrics   = metric_set(rmse, rsq))
glmnet_tuned.half <- glmnet_wf.half %>%
  tune_grid(.,
            resamples = cv_splits_geo.half,
            grid      = glmnet_grid,
            control   = control,
            metrics   = metric_set(rmse, rsq))

rf_tuned.half <- rf_wf.half %>%
  tune::tune_grid(.,
                  resamples = cv_splits_geo.half,
                  grid      = rf_grid,
                  control   = control,
                  metrics   = metric_set(rmse, rsq))

xgb_tuned.half <- xgb_wf.half %>%
  tune::tune_grid(.,
                  resamples = cv_splits_geo.half,
                  grid      = xgb_grid,
                  control   = control,
                  metrics   = metric_set(rmse, rsq))

show_best(lm_tuned.half, metric = "rmse", n = 15, maximize = FALSE)
show_best(glmnet_tuned.half, metric = "rmse", n = 15, maximize = FALSE)
show_best(rf_tuned.half, metric = "rmse", n = 15, maximize = FALSE)
show_best(xgb_tuned.half, metric = "rmse", n = 15, maximize = FALSE)

lm_best_params.half     <- select_best(lm_tuned.half, metric = "rmse", maximize = FALSE)
glmnet_best_params.half <- select_best(glmnet_tuned.half, metric = "rmse", maximize = FALSE)
rf_best_params.half     <- select_best(rf_tuned.half, metric = "rmse", maximize = FALSE)
xgb_best_params.half    <- select_best(xgb_tuned.half, metric = "rmse", maximize = FALSE)
#Final workflow
lm_best_wf.half     <- finalize_workflow(lm_wf.half, lm_best_params.half)
glmnet_best_wf.half <- finalize_workflow(glmnet_wf.half, glmnet_best_params.half)
rf_best_wf.half     <- finalize_workflow(rf_wf.half, rf_best_params.half)
xgb_best_wf.half    <- finalize_workflow(xgb_wf.half, xgb_best_params.half)

lm_val_fit_geo.half <- lm_best_wf.half %>% 
  last_fit(split     = data_split.half,
           control   = control,
           metrics   = metric_set(rmse, rsq))
glmnet_val_fit_geo.half <- glmnet_best_wf.half %>% 
  last_fit(split     = data_split.half,
           control   = control,
           metrics   = metric_set(rmse, rsq))

rf_val_fit_geo.half <- rf_best_wf.half %>% 
  last_fit(split     = data_split.half,
           control   = control,
           metrics   = metric_set(rmse, rsq))

xgb_val_fit_geo.half <- xgb_best_wf.half %>% 
  last_fit(split     = data_split.half,
           control   = control,
           metrics   = metric_set(rmse, rsq))

####################################Model Validation
# Pull best hyperparam preds from out-of-fold predictions
lm_best_OOF_preds.half <- collect_predictions(lm_tuned.half) 
glmnet_best_OOF_preds.half <- collect_predictions(glmnet_tuned.half) %>% 
  filter(penalty  == glmnet_best_params.half$penalty[1] & mixture == glmnet_best_params.half$mixture[1])
rf_best_OOF_preds.half <- collect_predictions(rf_tuned.half) %>% 
  filter(mtry  == rf_best_params.half$mtry[1] & min_n == rf_best_params.half$min_n[1])

xgb_best_OOF_preds.half <- collect_predictions(xgb_tuned.half) %>% 
  filter(mtry  == xgb_best_params.half$mtry[1] & min_n == xgb_best_params.half$min_n[1])
# collect validation set predictions from last_fit model
lm_val_pred_geo.half     <- collect_predictions(lm_val_fit_geo.half)
glmnet_val_pred_geo.half <- collect_predictions(glmnet_val_fit_geo.half)
rf_val_pred_geo.half     <- collect_predictions(rf_val_fit_geo.half)
xgb_val_pred_geo.half    <- collect_predictions(xgb_val_fit_geo.half)
# Aggregate OOF predictions (they do not overlap with Validation prediction set)
OOF_preds.half <- rbind(data.frame(dplyr::select(lm_best_OOF_preds.half, .pred, mean_on), model = "lm"),
                           data.frame(dplyr::select(glmnet_best_OOF_preds.half, .pred, mean_on), model = "glmnet"),
                           data.frame(dplyr::select(rf_best_OOF_preds.half, .pred, mean_on), model = "RF"),
                           data.frame(dplyr::select(xgb_best_OOF_preds.half, .pred, mean_on), model = "xgb")) %>% 
  group_by(model) %>% 
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on),
         #RSQUARE = yardstick::rsq(mean_on, .pred),
         RMSE = yardstick::rmse_vec(mean_on, .pred),
         #SD_RMSE = sd(yardstick::rmse_vec(mean_on, .pred)),
         MAE  = yardstick::mae_vec(mean_on, .pred),
         #SD_MAE = sd(yardstick::mae_vec(mean_on, .pred)),
         MAPE = yardstick::mape_vec(mean_on, .pred))%>%
  #SD_MAPE = sd(yardstick::mape_vec(mean_on, .pred))) %>% 
  ungroup()


# Aggregate predictions from Validation set
library(tidyverse)
library(yardstick)
#lm_val_pred_geo
val_preds.half <- rbind(data.frame(lm_val_pred_geo.half, model = "lm"),
                   data.frame(glmnet_val_pred_geo.half, model = "glmnet"),
                   data.frame(rf_val_pred_geo.half, model = "rf"),
                   data.frame(xgb_val_pred_geo.half, model = "xgb")) %>% 
  left_join(., data.half %>% 
              rowid_to_column(var = ".row") %>% 
              dplyr::select(typology, .row), 
            by = ".row") %>% 
  dplyr::group_by(model) %>%
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on),
         RMSE = yardstick::rmse_vec(mean_on, .pred),
         MAE  = yardstick::mae_vec(mean_on, .pred),
         MAPE = yardstick::mape_vec(mean_on, .pred))%>%
  ungroup()


summary(lm_val_pred_geo.half$MAE)
summary(glmnet_val_pred_geo.half$MAE)
summary(rf_val_pred_geo.half$MAE)
summary(xgb_val_pred_geo.half$MAE)
summary(lm_val_pred_geo.half$MAPE)
summary(glmnet_val_pred_geo.half$MAPE)
summary(rf_val_pred_geo.half$MAPE)
summary(xgb_val_pred_geo.half$MAPE)
summary(lm_val_pred_geo.half$RMSE)
summary(glmnet_val_pred_geo.half$RMSE)
summary(rf_val_pred_geo.half$RMSE)
summary(xgb_val_pred_geo.half$RMSE)
?group_by
#Rsquared
1- sum((lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred) ^ 2)/sum((lm_val_pred_geo$mean_on - mean(lm_val_pred_geo$mean_on)) ^ 2)
rsq(lm_val_pred_geo.half, mean_on, .pred)
sd(rsq(lm_val_pred_geo, mean_on, .pred))
rsq(glmnet_val_pred_geo.half, mean_on, .pred)
rsq(rf_val_pred_geo.half, mean_on, .pred)
rsq(xgb_val_pred_geo.half, mean_on, .pred)
#MAE and MAPE
mean(abs(lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred))
sd(abs(lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred))
mean(abs((lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred)/lm_val_pred_geo$mean_on))
sd(abs((lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred)/lm_val_pred_geo$mean_on))
sd(abs(lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred))

mean(abs(glmnet_val_pred_geo$mean_on - glmnet_val_pred_geo$.pred))
sd(abs(glmnet_val_pred_geo$mean_on - glmnet_val_pred_geo$.pred))
mean(abs((glmnet_val_pred_geo$mean_on - glmnet_val_pred_geo$.pred)/glmnet_val_pred_geo$mean_on))
sd(abs((glmnet_val_pred_geo$mean_on - glmnet_val_pred_geo$.pred)/glmnet_val_pred_geo$mean_on))

mean(abs(rf_val_pred_geo$mean_on - rf_val_pred_geo$.pred))
sd(abs(rf_val_pred_geo$mean_on - rf_val_pred_geo$.pred))
mean(abs((rf_val_pred_geo$mean_on - rf_val_pred_geo$.pred)/rf_val_pred_geo$mean_on))
sd(abs((rf_val_pred_geo$mean_on - rf_val_pred_geo$.pred)/rf_val_pred_geo$mean_on))

mean(abs(xgb_val_pred_geo$mean_on - xgb_val_pred_geo$.pred))
sd(abs(xgb_val_pred_geo$mean_on - xgb_val_pred_geo$.pred))
mean(abs((xgb_val_pred_geo$mean_on - xgb_val_pred_geo$.pred)/xgb_val_pred_geo$mean_on))
sd(abs((xgb_val_pred_geo$mean_on - xgb_val_pred_geo$.pred)/xgb_val_pred_geo$mean_on))
#RMSE
sqrt(mean((lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred)^2))
sqrt(mean((glmnet_val_pred_geo$mean_on - glmnet_val_pred_geo$.pred)^2))
sqrt(mean((rf_val_pred_geo$mean_on - rf_val_pred_geo$.pred)^2))
sqrt(mean((xgb_val_pred_geo$mean_on - xgb_val_pred_geo$.pred)^2))
sqrt(sd((lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred)^2))
sqrt(sd((glmnet_val_pred_geo$mean_on - glmnet_val_pred_geo$.pred)^2))
sqrt(sd((rf_val_pred_geo$mean_on - rf_val_pred_geo$.pred)^2))
sqrt(sd((xgb_val_pred_geo$mean_on - xgb_val_pred_geo$.pred)^2))

yardstick::rmse_vec(lm_val_pred_geo$mean_on, lm_val_pred_geo$.pred)
yardstick::mape_vec(lm_val_pred_geo$mean_on, lm_val_pred_geo$.pred)
###################################Modelling part ends here, below are the visualizations##############
#MAPE chart
ggplot(data = val_preds.half %>% 
         dplyr::select(model, MAPE) %>% 
         distinct() , 
       aes(x = model, y = MAPE, group = 1)) +
  geom_path(color = "blue") +
  geom_label(aes(label = paste0(round(MAPE,1),"%"))) +
  labs(title= "1/2-mi Buffer, MAPE of each model on the testing set with typology")
theme_bw()
#MAE chart
ggplot(data = val_preds.half%>% 
         dplyr::select(model, MAE) %>% 
         distinct() , 
       aes(x = model, y = MAE, group = 1)) +
  geom_path(color = "blue") +
  geom_label(aes(label = paste0(round(MAE,1)))) +
  labs(title= "1/2 mi Buffer, MAE of each model on the testing set with typology")
theme_bw()  
#RMSE
ggplot(data = val_preds.half%>% 
         dplyr::select(model, RMSE) %>% 
         distinct() , 
       aes(x = model, y = RMSE, group = 1)) +
  geom_path(color = "blue") +
  geom_label(aes(label = paste0(round(RMSE,1)))) +
  labs(title= "1/2 mi Buffer, RMSE of each model on the testing set with typology")
theme_bw() 
#Predicted vs Observed
ggplot(val_preds.half, aes(x =.pred, y = mean_on, group = model)) +
  geom_point() +
  geom_abline(linetype = "dashed", color = "red") +
  geom_smooth(method = "lm", color = "blue") +
  coord_equal() +
  facet_wrap(~model, nrow = 2) +
  labs(title="1/2 Mile: Predicted vs Observed on the testing set", subtitle= "blue line is predicted value") 
theme_bw()

#Neighborhood validation
val_MAPE_by_typology.half <- val_preds.half %>% 
  group_by(typology, model) %>% 
  summarise(RMSE = yardstick::rmse_vec(mean_on, .pred),
            MAE  = yardstick::mae_vec(mean_on, .pred),
            MAPE = yardstick::mape_vec(mean_on, .pred)) %>% 
  ungroup() 
#Barchart of the MAE in each neighborhood
plotTheme <- function(base_size = 10) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 20,colour = "black"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=20),
    axis.title = element_text(size=20),
    axis.text = element_text(size=15),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic", size= 20),
    legend.text = element_text(colour = "black", face = "italic",size = 20),
    strip.text.x = element_text(size = 15)
  )
}
palette4 <- c("#eff3ff", "#bdd7e7","#6baed6","#2171b5")


as.data.frame(val_MAPE_by_typology.half)%>%
  dplyr::select(typology, model, MAE) %>%
  #gather(Variable, MAE, -model, -typology) %>%
  ggplot(aes(typology, MAE)) + 
  geom_bar(aes(fill = model), position = "dodge", stat="identity") +
  ylim(0, 300)+
  scale_fill_manual(values = palette4) +
  facet_wrap(~typology, scale= "free", ncol=4)+
  labs(title = "1/2 mile: Mean Absolute Errors by model specification") +
  plotTheme()

Buffer size: 1/4 mile

#1/4 mile buffer with ridership: data.quarter
#typology
sum(is.na(All_3))
sum(is.na(all_2))
All_3$typology
summary(all_2$parkingDist)
typology<- All_3 %>%
  dplyr::select(STOP_ID, typology)
typology$typology <- ifelse(typology$typology == "CBD" , 'CBD',
                                ifelse(typology$typology == "UT", 'UT',
                                       ifelse(typology$typology == "UT&CBD", 'CBD', 'Rest')))

#write.csv(typology, "C:/Upenn/Practicum/Data/Typology_withSTOP_ID.csv")
?join
names(all_2)
data.quarter <- plyr::join(all_2, typology, type ="left")
data.quarter$STOP_ID <- NULL

data.quarter<-data.quarter %>%
  drop_na()
data.quarter$universityDist1<-NULL
#Slipt the data into training and testing sets
data_split.quarter <- rsample::initial_split(data.quarter, strata = "mean_on", prop = 0.75)

bus_train.quarter <- rsample::training(data_split.quarter)
bus_test.quarter  <- rsample::testing(data_split.quarter)
names(bus_train.quarter)


cv_splits_geo.quarter <- rsample::group_vfold_cv(bus_train.quarter,  strata = "mean_on", group = "typology")
print(cv_splits_geo)

#Create recipe
model_rec.quarter <- recipe(mean_on ~ ., data = bus_train.quarter) %>% #the "." means using every variable we have in the training dataset
  update_role(typology, new_role = "typology") %>% #This is more like to keep the neighborhood variable out of the model
  step_other(typology, threshold = 0.005) %>%
  step_dummy(all_nominal(), -typology) %>%
  step_log(mean_on) %>% 
  step_zv(all_predictors()) %>%
  step_center(all_predictors(), -mean_on) %>%
  step_scale(all_predictors(), -mean_on) #%>% #put on standard deviation scale
#step_ns(Latitude, Longitude, options = list(df = 4))
?step_cv
model_rec.quarter

#Build the model
lm_plan <- 
  parsnip::linear_reg() %>% 
  parsnip::set_engine("lm")

glmnet_plan <- 
  parsnip::linear_reg() %>% 
  parsnip::set_args(penalty  = tune()) %>%
  parsnip::set_args(mixture  = tune()) %>%
  parsnip::set_engine("glmnet")

rf_plan <- parsnip::rand_forest() %>%
  parsnip::set_args(mtry  = tune()) %>%
  parsnip::set_args(min_n = tune()) %>%
  parsnip::set_args(trees = 1000) %>% 
  parsnip::set_engine("ranger", importance = "impurity") %>% 
  parsnip::set_mode("regression")

XGB_plan <- parsnip::boost_tree() %>%
  parsnip::set_args(mtry  = tune()) %>%
  parsnip::set_args(min_n = tune()) %>%
  parsnip::set_args(trees = 100) %>% 
  parsnip::set_engine("xgboost") %>% 
  parsnip::set_mode("regression")

#
glmnet_grid <- expand.grid(penalty = seq(0, 1, by = .25), 
                           mixture = seq(0,1,0.25))

rf_grid <- expand.grid(mtry = c(2,5), 
                       min_n = c(1,5))
xgb_grid <- expand.grid(mtry = c(3,5), 
                        min_n = c(1,5))
#Create workflow
lm_wf.quarter <-
  workflows::workflow() %>% 
  add_recipe(model_rec.quarter) %>% 
  add_model(lm_plan)

glmnet_wf.quarter <-
  workflow() %>% 
  add_recipe(model_rec.quarter) %>% 
  add_model(glmnet_plan)

rf_wf.quarter <-
  workflow() %>% 
  add_recipe(model_rec.quarter) %>% 
  add_model(rf_plan)
xgb_wf.quarter <-
  workflow() %>% 
  add_recipe(model_rec.quarter) %>% 
  add_model(XGB_plan)
# fit model to workflow and calculate metrics
control <- tune::control_resamples(save_pred = TRUE, verbose = TRUE)
library(tune)
library(yardstick)
?tune_grid
?metric_set

lm_tuned.quarter <- lm_wf.quarter %>%
  fit_resamples(.,
                resamples = cv_splits_geo.quarter,
                control   = control,
                metrics   = metric_set(rmse, rsq))
glmnet_tuned.quarter <- glmnet_wf.quarter %>%
  tune_grid(.,
            resamples = cv_splits_geo.quarter,
            grid      = glmnet_grid,
            control   = control,
            metrics   = metric_set(rmse, rsq))

rf_tuned.quarter <- rf_wf.quarter %>%
  tune::tune_grid(.,
                  resamples = cv_splits_geo.quarter,
                  grid      = rf_grid,
                  control   = control,
                  metrics   = metric_set(rmse, rsq))

xgb_tuned.quarter <- xgb_wf.quarter %>%
  tune::tune_grid(.,
                  resamples = cv_splits_geo.quarter,
                  grid      = xgb_grid,
                  control   = control,
                  metrics   = metric_set(rmse, rsq))

show_best(lm_tuned.quarter, metric = "rmse", n = 15, maximize = FALSE)
show_best(glmnet_tuned.quarter, metric = "rmse", n = 15, maximize = FALSE)
show_best(rf_tuned.quarter, metric = "rmse", n = 15, maximize = FALSE)
show_best(xgb_tuned.quarter, metric = "rmse", n = 15, maximize = FALSE)

lm_best_params.quarter     <- select_best(lm_tuned.quarter, metric = "rmse", maximize = FALSE)
glmnet_best_params.quarter <- select_best(glmnet_tuned.quarter, metric = "rmse", maximize = FALSE)
rf_best_params.quarter     <- select_best(rf_tuned.quarter, metric = "rmse", maximize = FALSE)
xgb_best_params.quarter    <- select_best(xgb_tuned.quarter, metric = "rmse", maximize = FALSE)
#Final workflow
lm_best_wf.quarter     <- finalize_workflow(lm_wf.quarter, lm_best_params.quarter)
glmnet_best_wf.quarter <- finalize_workflow(glmnet_wf.quarter, glmnet_best_params.quarter)
rf_best_wf.quarter     <- finalize_workflow(rf_wf.quarter, rf_best_params.quarter)
xgb_best_wf.quarter    <- finalize_workflow(xgb_wf.quarter, xgb_best_params.quarter)

lm_val_fit_geo.quarter <- lm_best_wf.quarter %>% 
  last_fit(split     = data_split.quarter,
           control   = control,
           metrics   = metric_set(rmse, rsq))
glmnet_val_fit_geo.quarter <- glmnet_best_wf.quarter %>% 
  last_fit(split     = data_split.quarter,
           control   = control,
           metrics   = metric_set(rmse, rsq))

rf_val_fit_geo.quarter <- rf_best_wf.quarter %>% 
  last_fit(split     = data_split.quarter,
           control   = control,
           metrics   = metric_set(rmse, rsq))

xgb_val_fit_geo.quarter <- xgb_best_wf.quarter %>% 
  last_fit(split     = data_split.quarter,
           control   = control,
           metrics   = metric_set(rmse, rsq))

####################################Model Validation
# Pull best hyperparam preds from out-of-fold predictions
lm_best_OOF_preds.quarter <- collect_predictions(lm_tuned.quarter) 
glmnet_best_OOF_preds.quarter <- collect_predictions(glmnet_tuned.quarter) %>% 
  filter(penalty  == glmnet_best_params.quarter$penalty[1] & mixture == glmnet_best_params.quarter$mixture[1])
rf_best_OOF_preds.quarter <- collect_predictions(rf_tuned.quarter) %>% 
  filter(mtry  == rf_best_params.quarter$mtry[1] & min_n == rf_best_params.quarter$min_n[1])

xgb_best_OOF_preds.quarter <- collect_predictions(xgb_tuned.quarter) %>% 
  filter(mtry  == xgb_best_params.quarter$mtry[1] & min_n == xgb_best_params.quarter$min_n[1])
# collect validation set predictions from last_fit model
lm_val_pred_geo.quarter     <- collect_predictions(lm_val_fit_geo.quarter)
glmnet_val_pred_geo.quarter <- collect_predictions(glmnet_val_fit_geo.quarter)
rf_val_pred_geo.quarter     <- collect_predictions(rf_val_fit_geo.quarter)
xgb_val_pred_geo.quarter    <- collect_predictions(xgb_val_fit_geo.quarter)
# Aggregate OOF predictions (they do not overlap with Validation prediction set)
lm_best_OOF_preds$mean_on <- as.numeric(lm_best_OOF_preds$mean_on)
glmnet_best_OOF_preds$mean_on <- as.numeric(glmnet_best_OOF_preds$mean_on)
rf_best_OOF_preds$mean_on <- as.numeric(rf_best_OOF_preds$mean_on)
xgb_best_OOF_preds$mean_on <- as.numeric(xgb_best_OOF_preds$mean_on)

OOF_preds.quarter <- rbind(data.frame(dplyr::select(lm_best_OOF_preds.quarter, .pred, mean_on), model = "lm"),
                   data.frame(dplyr::select(glmnet_best_OOF_preds.quarter, .pred, mean_on), model = "glmnet"),
                   data.frame(dplyr::select(rf_best_OOF_preds.quarter, .pred, mean_on), model = "RF"),
                   data.frame(dplyr::select(xgb_best_OOF_preds.quarter, .pred, mean_on), model = "xgb")) %>% 
  group_by(model) %>% 
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on),
         #RSQUARE = yardstick::rsq(mean_on, .pred),
         RMSE = yardstick::rmse_vec(mean_on, .pred),
         #SD_RMSE = sd(yardstick::rmse_vec(mean_on, .pred)),
         MAE  = yardstick::mae_vec(mean_on, .pred),
         #SD_MAE = sd(yardstick::mae_vec(mean_on, .pred)),
         MAPE = yardstick::mape_vec(mean_on, .pred))%>%
         #SD_MAPE = sd(yardstick::mape_vec(mean_on, .pred))) %>% 
  ungroup()


# Aggregate predictions from Validation set
library(tidyverse)
library(yardstick)
#lm_val_pred_geo
val_preds.quarter <- rbind(data.frame(lm_val_pred_geo.quarter, model = "lm"),
                   data.frame(glmnet_val_pred_geo.quarter, model = "glmnet"),
                   data.frame(rf_val_pred_geo.quarter, model = "rf"),
                   data.frame(xgb_val_pred_geo.quarter, model = "xgb")) %>% 
  left_join(., data.quarter %>% 
              rowid_to_column(var = ".row") %>% 
              dplyr::select(typology, .row), 
            by = ".row") %>% 
  dplyr::group_by(model) %>%
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on),
         RMSE = yardstick::rmse_vec(mean_on, .pred),
         MAE  = yardstick::mae_vec(mean_on, .pred),
         MAPE = yardstick::mape_vec(mean_on, .pred))%>%
  ungroup()


summary(rf_val_pred_geo.quarter$MAE)
summary(xgb_val_pred_geo.quarter$MAE)
summary(lm_val_pred_geo.quarter$MAPE)
summary(glmnet_val_pred_geo.quarter$MAPE)
summary(rf_val_pred_geo.quarter$MAPE)
summary(xgb_val_pred_geo.quarter$MAPE)
summary(lm_val_pred_geo.quarter$RMSE)
summary(glmnet_val_pred_geo.quarter$RMSE)
summary(rf_val_pred_geo.quarter$RMSE)
summary(xgb_val_pred_geo.quarter$RMSE)
?group_by
#Rsquared
1- sum((lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred) ^ 2)/sum((lm_val_pred_geo$mean_on - mean(lm_val_pred_geo$mean_on)) ^ 2)
rsq(lm_val_pred_geo.quarter, mean_on, .pred)
sd(rsq(lm_val_pred_geo, mean_on, .pred))
rsq(glmnet_val_pred_geo.quarter, mean_on, .pred)
rsq(rf_val_pred_geo.quarter, mean_on, .pred)
rsq(xgb_val_pred_geo.quarter, mean_on, .pred)
#MAE and MAPE
mean(abs(lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred))
sd(abs(lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred))
mean(abs((lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred)/lm_val_pred_geo$mean_on))
sd(abs((lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred)/lm_val_pred_geo$mean_on))
sd(abs(lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred))

mean(abs(glmnet_val_pred_geo$mean_on - glmnet_val_pred_geo$.pred))
sd(abs(glmnet_val_pred_geo$mean_on - glmnet_val_pred_geo$.pred))
mean(abs((glmnet_val_pred_geo$mean_on - glmnet_val_pred_geo$.pred)/glmnet_val_pred_geo$mean_on))
sd(abs((glmnet_val_pred_geo$mean_on - glmnet_val_pred_geo$.pred)/glmnet_val_pred_geo$mean_on))

mean(abs(rf_val_pred_geo$mean_on - rf_val_pred_geo$.pred))
sd(abs(rf_val_pred_geo$mean_on - rf_val_pred_geo$.pred))
mean(abs((rf_val_pred_geo$mean_on - rf_val_pred_geo$.pred)/rf_val_pred_geo$mean_on))
sd(abs((rf_val_pred_geo$mean_on - rf_val_pred_geo$.pred)/rf_val_pred_geo$mean_on))

mean(abs(xgb_val_pred_geo$mean_on - xgb_val_pred_geo$.pred))
sd(abs(xgb_val_pred_geo$mean_on - xgb_val_pred_geo$.pred))
mean(abs((xgb_val_pred_geo$mean_on - xgb_val_pred_geo$.pred)/xgb_val_pred_geo$mean_on))
sd(abs((xgb_val_pred_geo$mean_on - xgb_val_pred_geo$.pred)/xgb_val_pred_geo$mean_on))
#RMSE
sqrt(mean((lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred)^2))
sqrt(mean((glmnet_val_pred_geo$mean_on - glmnet_val_pred_geo$.pred)^2))
sqrt(mean((rf_val_pred_geo$mean_on - rf_val_pred_geo$.pred)^2))
sqrt(mean((xgb_val_pred_geo$mean_on - xgb_val_pred_geo$.pred)^2))
sqrt(sd((lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred)^2))
sqrt(sd((glmnet_val_pred_geo$mean_on - glmnet_val_pred_geo$.pred)^2))
sqrt(sd((rf_val_pred_geo$mean_on - rf_val_pred_geo$.pred)^2))
sqrt(sd((xgb_val_pred_geo$mean_on - xgb_val_pred_geo$.pred)^2))

yardstick::rmse_vec(lm_val_pred_geo$mean_on, lm_val_pred_geo$.pred)
yardstick::mape_vec(lm_val_pred_geo$mean_on, lm_val_pred_geo$.pred)
###################################Modelling part ends here, below are the visualizations##############
#MAPE chart
ggplot(data = val_preds.quarter %>% 
         dplyr::select(model, MAPE) %>% 
         distinct() , 
       aes(x = model, y = MAPE, group = 1)) +
  geom_path(color = "blue") +
  geom_label(aes(label = paste0(round(MAPE,1),"%"))) +
  labs(title= "1/4-mi Buffer, MAPE of each model on the testing set with typology")
theme_bw()
#MAE chart
ggplot(data = val_preds.quarter%>% 
         dplyr::select(model, MAE) %>% 
         distinct() , 
       aes(x = model, y = MAE, group = 1)) +
  geom_path(color = "blue") +
  geom_label(aes(label = paste0(round(MAE,1)))) +
  labs(title= "1/4 mi Buffer, MAE of each model on the testing set with typology")
theme_bw()  
#RMSE
ggplot(data = val_preds.quarter %>% 
         dplyr::select(model, RMSE) %>% 
         distinct() , 
       aes(x = model, y = RMSE, group = 1)) +
  geom_path(color = "blue") +
  geom_label(aes(label = paste0(round(RMSE,1)))) +
  labs(title= "1/4 mi Buffer, RMSE of each model on the testing set with typology")
theme_bw() 
#Predicted vs Observed
ggplot(val_preds.quarter, aes(x =.pred, y = mean_on, group = model)) +
  geom_point() +
  geom_abline(linetype = "dashed", color = "red") +
  geom_smooth(method = "lm", color = "blue") +
  coord_equal() +
  facet_wrap(~model, nrow = 2) +
  labs(title="1/4 Mile: Predicted vs Observed on the testing set", subtitle= "blue line is predicted value") 
theme_bw()

#Neighborhood validation
val_MAPE_by_typology.quarter <- val_preds.quarter %>% 
  group_by(typology, model) %>% 
  summarise(RMSE = yardstick::rmse_vec(mean_on, .pred),
            MAE  = yardstick::mae_vec(mean_on, .pred),
            MAPE = yardstick::mape_vec(mean_on, .pred)) %>% 
  ungroup() 
#Barchart of the MAE in each neighborhood
plotTheme <- function(base_size = 10) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 20,colour = "black"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=20),
    axis.title = element_text(size=20),
    axis.text = element_text(size=15),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic", size= 20),
    legend.text = element_text(colour = "black", face = "italic",size = 20),
    strip.text.x = element_text(size = 15)
  )
}
palette4 <- c("#eff3ff", "#bdd7e7","#6baed6","#2171b5")


as.data.frame(val_MAPE_by_typology.quarter)%>%
  dplyr::select(typology, model, MAE) %>%
  #gather(Variable, MAE, -model, -typology) %>%
  ggplot(aes(typology, MAE)) + 
  geom_bar(aes(fill = model), position = "dodge", stat="identity") +
  ylim(0, 300)+
  scale_fill_manual(values = palette4) +
  facet_wrap(~typology, scale= "free", ncol=4)+
  labs(title = "1/4 mile: Mean Absolute Errors by model specification") +
  plotTheme()

Buffer size: 1/2 mile

#1/8-mile buffer
data.eighth <- join(all_eighth, typology, type ="left")
data.eighth$STOP_ID <- NULL

data.eighth<-data.eighth %>%
  drop_na()
data.eighth$universityDist1<-NULL
#Slipt the data into training and testing sets
data_split.eighth <- rsample::initial_split(data.eighth, strata = "mean_on", prop = 0.75)

bus_train.eighth <- rsample::training(data_split.eighth)
bus_test.eighth  <- rsample::testing(data_split.eighth)
names(bus_train.eighth)


cv_splits_geo.eighth <- rsample::group_vfold_cv(bus_train.eighth,  strata = "mean_on", group = "typology")
print(cv_splits_geo)

#Create recipe
model_rec.eighth <- recipe(mean_on ~ ., data = bus_train.eighth) %>% #the "." means using every variable we have in the training dataset
  update_role(typology, new_role = "typology") %>% #This is more like to keep the neighborhood variable out of the model
  step_other(typology, threshold = 0.005) %>%
  step_dummy(all_nominal(), -typology) %>%
  step_log(mean_on) %>% 
  step_zv(all_predictors()) %>%
  step_center(all_predictors(), -mean_on) %>%
  step_scale(all_predictors(), -mean_on) #%>% #put on standard deviation scale
#step_ns(Latitude, Longitude, options = list(df = 4))
?step_cv
model_rec.eighth

#Build the model
lm_plan <- 
  parsnip::linear_reg() %>% 
  parsnip::set_engine("lm")

glmnet_plan <- 
  parsnip::linear_reg() %>% 
  parsnip::set_args(penalty  = tune()) %>%
  parsnip::set_args(mixture  = tune()) %>%
  parsnip::set_engine("glmnet")

rf_plan <- parsnip::rand_forest() %>%
  parsnip::set_args(mtry  = tune()) %>%
  parsnip::set_args(min_n = tune()) %>%
  parsnip::set_args(trees = 1000) %>% 
  parsnip::set_engine("ranger", importance = "impurity") %>% 
  parsnip::set_mode("regression")

XGB_plan <- parsnip::boost_tree() %>%
  parsnip::set_args(mtry  = tune()) %>%
  parsnip::set_args(min_n = tune()) %>%
  parsnip::set_args(trees = 100) %>% 
  parsnip::set_engine("xgboost") %>% 
  parsnip::set_mode("regression")

#
glmnet_grid <- expand.grid(penalty = seq(0, 1, by = .25), 
                           mixture = seq(0,1,0.25))

rf_grid <- expand.grid(mtry = c(2,5), 
                       min_n = c(1,5))
xgb_grid <- expand.grid(mtry = c(3,5), 
                        min_n = c(1,5))
#Create workflow
lm_wf.eighth <-
  workflows::workflow() %>% 
  add_recipe(model_rec.eighth) %>% 
  add_model(lm_plan)

glmnet_wf.eighth <-
  workflow() %>% 
  add_recipe(model_rec.eighth) %>% 
  add_model(glmnet_plan)

rf_wf.eighth <-
  workflow() %>% 
  add_recipe(model_rec.eighth) %>% 
  add_model(rf_plan)
xgb_wf.eighth <-
  workflow() %>% 
  add_recipe(model_rec.eighth) %>% 
  add_model(XGB_plan)
# fit model to workflow and calculate metrics
control <- tune::control_resamples(save_pred = TRUE, verbose = TRUE)
library(tune)
library(yardstick)
?tune_grid
?metric_set

lm_tuned.eighth <- lm_wf.eighth %>%
  fit_resamples(.,
                resamples = cv_splits_geo.eighth,
                control   = control,
                metrics   = metric_set(rmse, rsq))
glmnet_tuned.eighth <- glmnet_wf.eighth %>%
  tune_grid(.,
            resamples = cv_splits_geo.eighth,
            grid      = glmnet_grid,
            control   = control,
            metrics   = metric_set(rmse, rsq))

rf_tuned.eighth <- rf_wf.eighth %>%
  tune::tune_grid(.,
                  resamples = cv_splits_geo.eighth,
                  grid      = rf_grid,
                  control   = control,
                  metrics   = metric_set(rmse, rsq))

xgb_tuned.eighth <- xgb_wf.eighth %>%
  tune::tune_grid(.,
                  resamples = cv_splits_geo.eighth,
                  grid      = xgb_grid,
                  control   = control,
                  metrics   = metric_set(rmse, rsq))

show_best(lm_tuned.eighth, metric = "rmse", n = 15, maximize = FALSE)
show_best(glmnet_tuned.eighth, metric = "rmse", n = 15, maximize = FALSE)
show_best(rf_tuned.eighth, metric = "rmse", n = 15, maximize = FALSE)
show_best(xgb_tuned.eighth, metric = "rmse", n = 15, maximize = FALSE)

lm_best_params.eighth     <- select_best(lm_tuned.eighth, metric = "rmse", maximize = FALSE)
glmnet_best_params.eighth <- select_best(glmnet_tuned.eighth, metric = "rmse", maximize = FALSE)
rf_best_params.eighth     <- select_best(rf_tuned.eighth, metric = "rmse", maximize = FALSE)
xgb_best_params.eighth    <- select_best(xgb_tuned.eighth, metric = "rmse", maximize = FALSE)
#Final workflow
lm_best_wf.eighth     <- finalize_workflow(lm_wf.eighth, lm_best_params.eighth)
glmnet_best_wf.eighth <- finalize_workflow(glmnet_wf.eighth, glmnet_best_params.eighth)
rf_best_wf.eighth     <- finalize_workflow(rf_wf.eighth, rf_best_params.eighth)
xgb_best_wf.eighth    <- finalize_workflow(xgb_wf.eighth, xgb_best_params.eighth)

lm_val_fit_geo.eighth <- lm_best_wf.eighth %>% 
  last_fit(split     = data_split.eighth,
           control   = control,
           metrics   = metric_set(rmse, rsq))
glmnet_val_fit_geo.eighth <- glmnet_best_wf.eighth %>% 
  last_fit(split     = data_split.eighth,
           control   = control,
           metrics   = metric_set(rmse, rsq))

rf_val_fit_geo.eighth <- rf_best_wf.eighth %>% 
  last_fit(split     = data_split.eighth,
           control   = control,
           metrics   = metric_set(rmse, rsq))

xgb_val_fit_geo.eighth <- xgb_best_wf.eighth %>% 
  last_fit(split     = data_split.eighth,
           control   = control,
           metrics   = metric_set(rmse, rsq))

####################################Model Validation
# Pull best hyperparam preds from out-of-fold predictions
lm_best_OOF_preds.eighth <- collect_predictions(lm_tuned.eighth) 
glmnet_best_OOF_preds.eighth <- collect_predictions(glmnet_tuned.eighth) %>% 
  filter(penalty  == glmnet_best_params.eighth$penalty[1] & mixture == glmnet_best_params.eighth$mixture[1])
rf_best_OOF_preds.eighth <- collect_predictions(rf_tuned.eighth) %>% 
  filter(mtry  == rf_best_params.eighth$mtry[1] & min_n == rf_best_params.eighth$min_n[1])

xgb_best_OOF_preds.eighth <- collect_predictions(xgb_tuned.eighth) %>% 
  filter(mtry  == xgb_best_params.eighth$mtry[1] & min_n == xgb_best_params.eighth$min_n[1])
# collect validation set predictions from last_fit model
lm_val_pred_geo.eighth     <- collect_predictions(lm_val_fit_geo.eighth)
glmnet_val_pred_geo.eighth <- collect_predictions(glmnet_val_fit_geo.eighth)
rf_val_pred_geo.eighth     <- collect_predictions(rf_val_fit_geo.eighth)
xgb_val_pred_geo.eighth    <- collect_predictions(xgb_val_fit_geo.eighth)
# Aggregate OOF predictions (they do not overlap with Validation prediction set)
OOF_preds.eighth <- rbind(data.frame(dplyr::select(lm_best_OOF_preds.eighth, .pred, mean_on), model = "lm"),
                           data.frame(dplyr::select(glmnet_best_OOF_preds.eighth, .pred, mean_on), model = "glmnet"),
                           data.frame(dplyr::select(rf_best_OOF_preds.eighth, .pred, mean_on), model = "RF"),
                           data.frame(dplyr::select(xgb_best_OOF_preds.eighth, .pred, mean_on), model = "xgb")) %>% 
  group_by(model) %>% 
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on),
         #RSQUARE = yardstick::rsq(mean_on, .pred),
         RMSE = yardstick::rmse_vec(mean_on, .pred),
         #SD_RMSE = sd(yardstick::rmse_vec(mean_on, .pred)),
         MAE  = yardstick::mae_vec(mean_on, .pred),
         #SD_MAE = sd(yardstick::mae_vec(mean_on, .pred)),
         MAPE = yardstick::mape_vec(mean_on, .pred))%>%
  #SD_MAPE = sd(yardstick::mape_vec(mean_on, .pred))) %>% 
  ungroup()


# Aggregate predictions from Validation set
library(tidyverse)
library(yardstick)
#lm_val_pred_geo
detach(package:plyr)
val_preds.eighth <- rbind(data.frame(lm_val_pred_geo.eighth, model = "lm"),
                   data.frame(glmnet_val_pred_geo.eighth, model = "glmnet"),
                   data.frame(rf_val_pred_geo.eighth, model = "rf"),
                   data.frame(xgb_val_pred_geo.eighth, model = "xgb")) %>% 
  left_join(., data.eighth %>% 
              rowid_to_column(var = ".row") %>% 
              dplyr::select(typology, .row), 
            by = ".row") %>% 
  group_by(model) %>%
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on),
         RMSE = yardstick::rmse_vec(mean_on, .pred),
         MAE  = mean(abs(mean_on - .pred)),
         MAPE = yardstick::mape_vec(mean_on, .pred))%>%
  ungroup()


lm_val_pred_geo.eighth<- lm_val_pred_geo.eighth%>%
  left_join(., data.eighth %>% 
              rowid_to_column(var = ".row") %>% 
              dplyr::select(typology, .row), 
            by = ".row")%>%
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on))


lm_val_pred_geo.eighth%>%
  group_by(typology)%>%
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on),
         Error = abs(mean_on, .pred),
         RMSE = yardstick::rmse_vec(mean_on, .pred),
         #MAE  = yardstick::mae_vec(mean_on, .pred),
         MAE = mean(Error),
         MAPE = yardstick::mape_vec(mean_on, .pred))

glmnet_val_pred_geo.eighth<- glmnet_val_pred_geo.eighth%>%
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on),
         RMSE = yardstick::rmse_vec(mean_on, .pred),
         MAE  = yardstick::mae_vec(mean_on, .pred),
         MAPE = yardstick::mape_vec(mean_on, .pred))

rf_val_pred_geo.eighth<- rf_val_pred_geo.eighth%>%
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on),
         RMSE = yardstick::rmse_vec(mean_on, .pred),
         MAE  = yardstick::mae_vec(mean_on, .pred),
         MAPE = yardstick::mape_vec(mean_on, .pred))

xgb_val_pred_geo.eighth<- xgb_val_pred_geo.eighth%>%
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on),
         RMSE = yardstick::rmse_vec(mean_on, .pred),
         MAE  = yardstick::mae_vec(mean_on, .pred),
         MAPE = yardstick::mape_vec(mean_on, .pred))

val_preds.eighth <- rbind(data.frame(lm_val_pred_geo.eighth, model = "lm"),
                           data.frame(glmnet_val_pred_geo.eighth, model = "glmnet"),
                           data.frame(rf_val_pred_geo.eighth, model = "rf"),
                           data.frame(xgb_val_pred_geo.eighth, model = "xgb"))%>%
  left_join(., data.eighth %>% 
              rowid_to_column(var = ".row") %>% 
              dplyr::select(typology, .row), 
            by = ".row")
summary(lm_val_pred_geo.eighth$MAE)
summary(glmnet_val_pred_geo.eighth$MAE)
summary(rf_val_pred_geo.eighth$MAE)
summary(xgb_val_pred_geo.eighth$MAE)
summary(lm_val_pred_geo.eighth$MAPE)
summary(glmnet_val_pred_geo.eighth$MAPE)
summary(rf_val_pred_geo.eighth$MAPE)
summary(xgb_val_pred_geo.eighth$MAPE)
summary(lm_val_pred_geo.eighth$RMSE)
summary(glmnet_val_pred_geo.eighth$RMSE)
summary(rf_val_pred_geo.eighth$RMSE)
summary(xgb_val_pred_geo.eighth$RMSE)

#Rsquared
rsq(lm_val_pred_geo.eighth, mean_on, .pred)
sd(rsq(lm_val_pred_geo, mean_on, .pred))
rsq(glmnet_val_pred_geo.eighth, mean_on, .pred)
rsq(rf_val_pred_geo.eighth, mean_on, .pred)
rsq(xgb_val_pred_geo.eighth, mean_on, .pred)

###################################Modelling part ends here, below are the visualizations##############
#MAPE chart
ggplot(data = val_preds.eighth %>% 
         dplyr::select(model, MAPE) %>% 
         distinct() , 
       aes(x = model, y = MAPE, group = 1)) +
  geom_path(color = "blue") +
  geom_label(aes(label = paste0(round(MAPE,1),"%"))) +
  labs(title= "1/8-mi Buffer, MAPE of each model on the testing set with typology")
theme_bw()
#MAE chart
ggplot(data = val_preds.eighth%>% 
         dplyr::select(model, MAE) %>% 
         distinct() , 
       aes(x = model, y = MAE, group = 1)) +
  geom_path(color = "blue") +
  geom_label(aes(label = paste0(round(MAE,1)))) +
  labs(title= "1/8 mi Buffer, MAE of each model on the testing set with typology")
theme_bw()  
#RMSE
ggplot(data = val_preds.eighth %>% 
         dplyr::select(model, RMSE) %>% 
         distinct() , 
       aes(x = model, y = RMSE, group = 1)) +
  geom_path(color = "blue") +
  geom_label(aes(label = paste0(round(RMSE,1)))) +
  labs(title= "1/8 mi Buffer, RMSE of each model on the testing set with typology")
theme_bw() 
#Predicted vs Observed
ggplot(val_preds.eighth, aes(x =.pred, y = mean_on, group = model)) +
  geom_point() +
  geom_abline(linetype = "dashed", color = "red") +
  geom_smooth(method = "lm", color = "blue") +
  coord_equal() +
  facet_wrap(~model, nrow = 2) +
  labs(title="1/8 Mile: Predicted vs Observed on the testing set", subtitle= "blue line is predicted value") 
theme_bw()

val_MAPE_by_typology.eighth <- val_preds.eighth %>% 
  group_by(typology, model) %>% 
  summarise(RMSE = yardstick::rmse_vec(mean_on, .pred),
            MAE  = yardstick::mae_vec(mean_on, .pred),
            MAPE = yardstick::mape_vec(mean_on, .pred)) %>% 
  ungroup() 
#Barchart of the MAE in each neighborhood
plotTheme <- function(base_size = 10) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 20,colour = "black"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=20),
    axis.title = element_text(size=20),
    axis.text = element_text(size=15),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic", size= 20),
    legend.text = element_text(colour = "black", face = "italic",size = 20),
    strip.text.x = element_text(size = 15)
  )
}
palette4 <- c("#eff3ff", "#bdd7e7","#6baed6","#2171b5")


as.data.frame(val_MAPE_by_typology.eighth)%>%
  dplyr::select(typology, model, MAE) %>%
  #gather(Variable, MAE, -model, -typology) %>%
  ggplot(aes(typology, MAE)) + 
  geom_bar(aes(fill = model), position = "dodge", stat="identity") +
  ylim(0, 300)+
  scale_fill_manual(values = palette4) +
  facet_wrap(~typology, scale= "free", ncol=4)+
  labs(title = "1/8 mile: Mean Absolute Errors by model specification") +
  plotTheme()

Build the kitchen sink model using selected variables

sum(is.na(sce0))
library(dplyr)
install.packages("mltools")
library(mltools)
sce0<-sce %>% drop_na()

sce0 <- plyr::join(sce0, typology, type= "left")

library(data.table)
sce0 <- 
  as.data.table(sce0)%>%
  one_hot(cols = "SN_cat",
          dropCols = TRUE)
names(sce0)
sce0 <- 
  as.data.table(sce0)%>%
  one_hot(cols = "Crosstown_cat",
          dropCols = TRUE)

sce0 <- 
  as.data.table(sce0)%>%
  one_hot(cols = "Express_cat",
          dropCols = TRUE)

sce0 <- 
  as.data.table(sce0)%>%
  one_hot(cols = "Local_cat",
          dropCols = TRUE)
sce0 <- 
  as.data.table(sce0)%>%
  one_hot(cols = "Flyer_cat",
          dropCols = TRUE)
sce0 <- 
  as.data.table(sce0)%>%
  one_hot(cols = "NightOwl_cat",
          dropCols = TRUE)
sce0 <- 
  as.data.table(sce0)%>%
  one_hot(cols = "HighFreq_cat",
          dropCols = TRUE)
sce0 <- 
  as.data.table(sce0)%>%
  one_hot(cols = "InOut_cat",
          dropCols = TRUE)
sce0 <- 
  as.data.table(sce0)%>%
  one_hot(cols = "Clockwise_cat",
          dropCols = TRUE)
sce0 <- 
  as.data.table(sce0)%>%
  one_hot(cols = "utshuttle_cat",
          dropCols = TRUE)
sce0 <- 
  as.data.table(sce0)%>%
  one_hot(cols = "Special_cat",
          dropCols = TRUE)

sce0 <- sce0 %>%dplyr::select(-STOP_ID)
data_split.sce0 <- rsample::initial_split(sce0, strata = "mean_on", prop = 0.75)

bus_train.sce0 <- rsample::training(data_split.sce0)
bus_test.sce0  <- rsample::testing(data_split.sce0)
names(bus_train.quarter)

cv_splits_geo.sce0 <- rsample::group_vfold_cv(bus_train.sce0,  strata = "mean_on", group = "typology")
print(cv_splits_geo)

model_rec.sce0 <- recipe(mean_on ~ ., data = bus_train.sce0) %>% #the "." means using every variable we have in the training dataset
  update_role(typology, new_role = "typology") %>% #This is more like to keep the neighborhood variable out of the model
  step_other(typology, threshold = 0.005) %>%
  step_dummy(all_nominal(), -typology) %>%
  step_log(mean_on) %>% 
  step_zv(all_predictors()) %>%
  step_center(all_predictors(), -mean_on) %>%
  step_scale(all_predictors(), -mean_on) #%>% #put on standard deviation scale
#step_ns(Latitude, Longitude, options = list(df = 4))


#Create workflow

rf_wf.sce0 <-
  workflow() %>% 
  add_recipe(model_rec.sce0) %>% 
  add_model(rf_plan)

# fit model to workflow and calculate metrics
#Metrics are changes from rmse + rsq to only rsq
rf_tuned.sce0 <- rf_wf.sce0 %>%
  tune::tune_grid(.,
                  resamples = cv_splits_geo.sce0,
                  grid      = rf_grid,
                  control   = control,
                  metrics   = metric_set(rsq))

?tune_grid
show_best(rf_tuned.sce0, metric = "rsq", n = 15, maximize = FALSE)



rf_best_params.sce0     <- select_best(rf_tuned.sce0, metric = "rsq", maximize = FALSE)

#Final workflow
rf_best_wf.sce0     <- finalize_workflow(rf_wf.sce0, rf_best_params.sce0)

rf_val_fit_geo.sce0 <- rf_best_wf.sce0 %>% 
  last_fit(split     = data_split.sce0,
           control   = control,
           metrics   = metric_set(rsq))

####################################Model Validation
# Pull best hyperparam preds from out-of-fold predictions
rf_best_OOF_preds.sce0 <- collect_predictions(rf_tuned.sce0) %>% 
  filter(mtry  == rf_best_params.sce0$mtry[1] & min_n == rf_best_params.sce0$min_n[1])
# collect validation set predictions from last_fit model
rf_val_pred_geo.sce0     <- collect_predictions(rf_val_fit_geo.sce0)
# Aggregate predictions from Validation set
library(tidyverse)
library(yardstick)
#lm_val_pred_geo
rf_best_OOF_preds.sce0 <- rf_best_OOF_preds.sce0 %>% dplyr::select(-min_n, -mtry)
val_preds.sce0 <- rbind(data.frame(rf_val_pred_geo.sce0), data.frame(rf_best_OOF_preds.sce0) )%>% 
  left_join(., sce0 %>% 
              rowid_to_column(var = ".row") %>% 
              dplyr::select(typology, .row), 
            by = ".row") %>% 
  dplyr::group_by(typology) %>%
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on),
         RMSE = yardstick::rmse_vec(mean_on, .pred),
         MAE  = yardstick::mae_vec(mean_on, .pred),
         MAPE = yardstick::mape_vec(mean_on, .pred))%>%
  ungroup()

val_MAPE_by_typology.sce0 <- val_preds.sce0 %>% 
  group_by(typology) %>% 
  summarise(RMSE = yardstick::rmse_vec(mean_on, .pred),
            MAE  = yardstick::mae_vec(mean_on, .pred),
            MAPE = yardstick::mape_vec(mean_on, .pred)) %>% 
  ungroup() 
#Barchart of the MAE in each neighborhood
plotTheme <- function(base_size = 10) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 20,colour = "black"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=20),
    axis.title = element_text(size=20),
    axis.text = element_text(size=15),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic", size= 20),
    legend.text = element_text(colour = "black", face = "italic",size = 20),
    strip.text.x = element_text(size = 15)
  )
}
palette4 <- c("#eff3ff", "#bdd7e7","#6baed6","#2171b5")


as.data.frame(val_MAPE_by_typology.sce0)%>%
  dplyr::select(typology,MAPE) %>%
  #gather(Variable, MAE, -model, -typology) %>%
  ggplot(aes(typology, MAPE)) + 
  geom_bar(aes(fill = typology), position = "dodge", stat="identity") +
  ylim(0, 120)+
  scale_fill_manual(values = palette4) +
  facet_wrap(~typology, scale= "free", ncol=4)+
  labs(title = "MAPE of the random forest model") +
  plotTheme()